1 Introduction

Chemistry is the central science—organic synthesis, drug discovery and analytical techniques are the major domains that is utilizing artificial intelligent methods such as, chemometrics and machine learning resulting in a major transformation. Artificial intelligence (AI) garnered attention of chemists in early 1950s, yet at that time, computer-based learning was obscure or esoteric for solving chemistry problems. However, this situation did not persist for long. Over centuries, chemists have amassed huge data of chemical structures by performing several experiments. At that time, chemometrics was used to demonstrate computer usage in chemistry that assisted in solving complex problems. Massart et al. (1997) defined chemometrics as a “chemical discipline that uses mathematics, statistics, and formal logic (a) to design or select optimal experimental procedures; (b) to provide maximum relevant chemical information by analysing chemical data; and (c) to obtain knowledge about chemical systems.” In 1975, a seminal paper featured ‘chemometrics’ in the title bringing a novel idea of utilizing computing tools to study complex chemical data (Kowalski 1975). In 1977, Analytical Chimica Acta introduced a section to communicate developing area of chemometrics pertaining to computer-assisted analysis especially for chromatography, UV, IR, 13C-NMR, and mass spectrometric data (Clerc and Ziegler 1977). The section was devoted to pioneering work on NIPALS algorithm for principal component analysis, SIMCA and KNN algorithms for pattern recognition. Hence, chemometrics was primarily applied in pattern recognition that were influenced by two-fold approaches viz.

  1. (a)

    kernel methods, machine learning, self-organizing maps, and support vector machines

  2. (b)

    statistical methods such as, discriminant analysis, method validation, Bayesian models.

In a strict sense, chemometrics is typically a mathematical and statistical computer-based modelling utilized for optimizing methods and extracting results from analytical data. It was only from 1988, that the term “machine learning,” made its debut in chemical literature titles (Appel et al. 1988; Gelernter et al. 1990; Sternberg et al. 1992; Salin and Winston 1992) and has ever since been used till date. In essence, chemometrics and machine learning has a fine distinction, as the former relies on linear relationships of data, while the latter deals with large and non-linear datasets. Machine learning involves the training of algorithms with chemical data and allows them to learn by examples. A trained machine learning model is deployed to deliver intelligent decisions. This necessitates using good data for machine learning models to navigate in solving chemical problems.

Today, chemists are consistently exploiting ML and chemometrics to solve challenging problems. This upsurge became apparent when Baum et al. (2021) reported the rise in journals and patents featuring AI-based methods in chemistry. Considering this increased interest and the hype of rapid march towards chemical automation led to the genesis of this review. The present review article describes the utilization of chemometrics and ML in chemistry, particularly in organic synthesis and analytical chemistry. It discusses different expert systems utilized in organic synthesis. It essentially covers the earliest attempts of retrosynthesis and current ML methods applied to organic synthesis with chosen examples. Next, we describe reported literature showcasing efforts undertaken by medicinal chemists for COVID-19 therapy. Further, the progress of ML techniques applied to spectroscopy, microscopy and chromatography is presented. Due to the interdisciplinary nature of this review, discussions between chemists, computer scientists and mathematicians may lead to better investigations and unravelling mysteries of the chemical world. An attempt to address some tough questions on the current scenario of ML-based methods on chemistry is reflected. Rather than focusing on one particular domain, the present review aims to address selected domains of chemistry so as to bring out the divergent role of AI wholly.

2 Pacing organic synthesis with machine learning

Chemical space is a conceptual area that contains all possible chemical entities. It was envisioned by Lipinski and Hopkins (2004) that there are about 10180 number of possible molecules and about 1060 number of small organic molecules. Organic chemists are delving in this chemical space for exploring novel drug molecules. Due to the large possibility of molecules in chemical space, the search for novel molecules is challenging as a human endeavour bringing machine learning techniques as an attractive technology to the fore.

Chemistry is a new language to be learnt by machines that can efficiently predict organic synthesis routes at a faster pace. Before, we delve into machine learning methods, it is essential to describe earlier attempts made to study and predict organic reaction outcomes. Lederberg (1964) made an earliest attempt of an intelligent system in chemistry called the DENDRAL project that assisted chemists in identifying organic molecules from MS data. DENDRAL has been considered a pioneering expert system that automated problem-solving tasks of synthetic chemists. It was coded in INTERLISP and comprised of heuristic-DENDRAL and meta-DENDRAL modules. The heuristic-DENDRAL expert system worked on ‘Plan-Generate-Test’ sequence for organic structure elucidation using MS data. The meta-DENDRAL module predicted correct spectral data of novel molecules using chemistry rules. DENDRAL came across as a precursor for upcoming expert systems in chemistry and with pioneering work of developing knowledgebase of organic reactions by Elias J Corey led to retrosynthesis and their computing tools (Corey 1967). We also come across seminal work by Dugundji & Ugi (1973) who conceptualized algebraic matrix model called FIEM for understanding organic synthesis and mechanisms. In the following sections, various retrosynthetic tools developed by far, are discussed to express their growth in synthesis planning.

2.1 Solving maze of organic synthesis using retrosynthesis

The journey of organic synthesis dates to about 200 years ago when Wöhler (1828) prepared urea and oxalic acid. A typical problem in an organic synthesis is the structural description of the molecule to be prepared, called as the target organic molecule (TOM). TOM are compounds with important properties that could be a promising therapeutic agent or an industrially important intermediate.

Routinely, synthetic routes were performed by chemists with innate retrosynthesis—primarily a pen-paper method where chemists hand-draw the pathways based on chemistry-based general rules and their intuition. Retrosynthesis is a conceptual problem-solving strategy for transforming the TOM to simpler starting materials that allows tracing of the feasible organic synthetic route to original target molecule (Fig. 1).

Fig. 1
figure 1

a Schematic representation of retrosynthesis. DIS means disconnection. b retrosynthesis of gabapentin via whole-molecule or FGI types are depicted. There could be more than one way of interconverting FG of gabapentin; other than those depicted in the figure

A synthetic chemist works backwards from the TOM by assuming possible disconnections about the chemical bonds. These disconnections generate synthons that refers to fragments, usually unstable species such as, ion or a radical. All these disconnections are the not real bond-breaking steps, rather is a mental foresight of the chemist based on general rules. Retrosynthesis is conceived in two forms namely, target-oriented or (whole molecule) and functional group interconversion (FGI) provided as an example of gabapentin (TOM), an anticonvulsant drug. It is only axiomatic to predict synthons for gabapentin via whole-molecule retrosynthetic strategy; it is FGI that exhibits wider options to arrive at possible starting reactants (Santos and Heggie 2020). In Fig. 1b, either 1-methylene cyclohexane or cyclohexanone are potential starting materials for synthesizing gabapentin. 1-methylene cyclohexane is problematic and costly whereas, cyclohexanone is toxic and an irritant in nature. Hence, chemists are supposed to make choices with certain trade-offs. It is advisable to select synthetic routes on the basis of availability of reagents, cost, and fewer reaction steps. Hence, in the case of gabapentin, either choose cyclohexanone or, a whole another class of organic reaction that shall generate fewer reaction steps and lesser toxic starting materials.

If the TOM is a complex entity, there is a greater chance of various distinct synthetic routes to prepare them. One can have over 1018 feasible one-step reaction routes to prepare target molecules. This led to Corey and Wipke (1969) to propose logic-oriented computer approach referred as synthesis tree search. In this approach, organic reactions are viewed as AND/OR tree, where the tree descends from the TOM i.e., goal node to the terminal nodes that routes further to the simpler molecular fragments. These molecular fragments are the possible starting molecules. The branches that connect the goal and terminal nodes are the organic reactions. To put it simply, the AND-node and OR node links of the decision tree refers to organic reactions and molecules, respectively. Figure 2 depicts FGI scheme of AND/OR synthesis tree of gabapentin. Corey’s seminal work was devoted to developing retrosynthetic tools that shifted focus of chemists from intuition-based strategies to logic-based instructions. (Corey et al. 1985). The earliest known retrosynthetic tool for assisting chemists was the LHASA program (Pensak and Corey 1977). LHASA was based on a sixfold strategy for retrosynthesis viz, transform, mechanistic transform, structure goal, stereochemical, topological and FG strategies. It used a special language CHMTRN constructed to note and search for disconnections.

Fig. 2
figure 2

AND/OR synthesis tree representation of FGI scheme of gabapentin. The AND node is the organic reaction and OR node refers to molecules. There can be more AND/OR nodes than those shown in the figure. The dashed boxes and lines of D, E and F refer to different chemical routes generating different starting materials other than those depicted in retrosynthetic scheme

Following LHASA, almost all retrosynthetic tools worked on these six-fold strategies for synthesis planning. CASP (Salatin and Jorgensen 1980) and CAMEO (Jorgensen et al. 1990) expert systems assisted chemists to find feasible synthetic routes and predict products respectively.

The provision of graphical knowledgebase editor was an innovative step in CASP that allowed communication with the chemist. CAMEO was a forward prediction tool that had selected rules on nucleophilicity, pKa, nature of leaving groups and steric effects which were used to rank the chemical reactions that the target molecule undergoes during substitution reactions. The forward prediction programs SOPHIA (Satoh and Funatsu 1995) and EROS (Gasteiger and Jochum 1978) assisted chemists in identification of active functional groups using reactivity rules and calculations. Ellerman et al. (1997) reported COSYMA program that searched for FGI and protecting-deprotecting groups. With advancing computing power, the task moved from number crunching to logic and reasoning leading to SYNCHEM and SYNCHEM2 programs. In SYNCHEM program, the initial stage involved chemists’ choice of synthetic strategies to be tried out called ‘synthemes.’ Each syntheme had its own set of transforms that led to retrosynthetic routes and resultant precursors were assessed and ranked. The higher ranked precursors were processed further, which led to suitable material search in the reaction library in SYNCHEM (Gelernter et al. 1977). Further, Benstock et al. (1988) included stereochemistry that led to development of the SYNCHEM2 program. They entered chemical structures in SYNCHEM using WLN representation, whereas in SYNCHEM2, a linear SLING representation was used. Mehta et al. (1998) reported the SESAM program that utilized a backtracking algorithm to determine suitable starting materials to the target molecule. Hanessian et al. (1990) demonstrated CHIRON database search for synthetic routes to stereochemical compounds to obtain starting materials that showed maximum overlap of carbon skeleton, FGs and stereochemistry.

Many programs helped retrosynthetic planning such as PASCOP (Choplin et al. 1978), RETROSYN (Blurock 1990), WODCA (Gasteiger and Ihlenfeldt 1990), KOSP (Satoh and Funatsu 1999), ROBIA (Socorro et al. 2005), and CAOSP (Bersohn 1972; Tanaka et al. 2010) that were either a retrosynthetic or a forward prediction program. It is evident that synthetic chemists were utilizing computer-aided retrosynthesis. Further, if one incorporates machine learning in organic synthesis, it shall lead to evolutionary change in proposing forward syntheses. It is evident that post-LHASA, many expert systems allowed automation for planning multistep synthesis in chemistry laboratories. However, they could visualize only one step at a time for simpler target organic molecules. Thus, such a program caused an impediment for its application in multistep natural product syntheses. Table 1 enlists the current programs that assist chemists for selecting novel route of organic syntheses.

Table 1 Features of present computer programs that support organic synthesis planning

As the computing capacity kept increasing, algorithms needed improvisations and organic reaction forward prediction programs were visualized in two modalities viz, template-based and template-free methods, both of which had their trials in a chemical prediction task. Template-based methods are rule-based with reaction libraries and scoring functions; those that are discussed earlier in this section. Template-based approach may be a good starting point, but the basic premise of generating and extracting algorithms from set templates may spruce bias in the data as it largely relies on chemists' intuition. Template-free approach solves the bias issue which includes utilizing NNs and seq-2-seq models.

Nam and Kim (2016) pioneered neural machine translation for predicting reactions from patent dataset and Wade’s Organic chemistry textbook. They trained their model with patent reactions spanning from 2001 to 2013 US applications and 75 reactions for five different starting molecules given as text problem in Wade’s book. Liu et al. (2017) pioneered data-driven model that learnt reaction predictions by seq-2-seq recurrent NNs that was trained with 50,000 experiments from the US patent literature using SMILES text representation.

Recalling the point to consider chemistry as a language by machines (refer Fig. 3), Schwaller et al. (2018) moved a step ahead by demonstrating computational liguistics to solve chemical predictions. They related organic chemistry to a language and applied template-free seq-2-seq models. Adopting a model reported by Vaswani et al. (2017) and using SMILES representation, Schwaller’s team developed Molecular Transformer that demonstrated higher accuracy for predicting reaction outcomes (Schwaller et al. 2019). Further, it could accurately predict selectivity, specificity, regioselectivity and chemoselectivity of the reactions. Intriguingly, this model is utilized in IBM RXN (refer Table 1). Other efforts of ML in organic synthesis worth mentioning are automation in chemical sciences (Dragone et al. 2017), ML-based reaction optimization (Gao et al. 2018), and DL-based chemical pattern prediction (Cova and Pais 2019).

Fig. 3
figure 3

Representation of typical Diels–Alder reaction between cyclopentadiene and maleic anhydride. a Kekulé type reaction graph; b parameter table allows optimization data capture for chemical reaction; c and d uses markup and natural languages respectively, of which the former is of greater significance; e, f and g describes reactions as ReactionSMILES, chemical fingerprints and descriptors that is easier for machines to understand. CGR means Condensed Graph of Reaction

It is opined that natural product synthesis, organocatalysis and drug discovery are the three fundamental areas of chemistry that are utilizing state-of-the-art ML techniques. Natural product synthesis and organocatalysis particularly, fall under the category of organic chemistry and have witnessed major transformation in terms of retrosynthesis, and hence covered in the next section. Considering the expanse of drug discovery and repurposing, it is discussed in a separate Sect. 3.

2.1.1 Natural product syntheses

Natural products are complex target molecules with multiple cyclization reactions making the synthetic routes difficult to interpret. Chemists find planning multistep natural product synthesis a challenging endeavour. If one integrates computational methods with AI technique, it shall be of great relevance to understand natural product synthesis. Tantillo (2018) discussed typical questions that can be solved using computational modelling of natural product synthesis. Marth et al. (2015) reported natural product synthesis of weisaconitine D and liljestrandinine by modelling network analysis along with AI- assisted retrosynthesis. Kim et al. (2019) reported total synthesis of Paspaline A and Emindole PB using a computational model integrated with AI assisted retrosynthesis. Chematica team designed machine-tuned natural product syntheses of (\(-)\)-dauricine, (R,R,S)-tacamonidine and lamellodysidine A that were reported to be comparable to those designed by skilled chemists (Klucznik et al. 2020).

2.1.2 Organocatalysis

Asymmetric enantioselective organocatalysis is ranked as one of the emerging chemically sustainable technologies (Gomollón-Bel 2019). The effect of isoxazole additives on carbon–nitrogen coupling Buchwald-Hartwig reaction was reported that used machine learning to predict reaction outcomes using random forest (Ahneman et al. 2018). Kondo et al. (2020) demonstrated atom-efficient organocatalyzed enantioselective Rauhut-Currier and [3 + 2] annulation reactions for chiral spirooxindole analogue in a flow system. The authors applied Gaussian regression to multi-parameter reaction screening processes. It is realized that determining transition states of enantioselective reactions is time-consuming and lacks accuracy, bringing ML to the rescue. Gallarati et al. (2021) developed ML model that predicted enantioselectivity of Lewis-catalysed propargylation reactions. Further, the ML model predicted absolute configuration of enantiomeric excess product independently. This work is unique, as enantioselectivity of an organocatalyst is a challenging task to be predicted by ML models. The authors represented propargylation reaction to ML model trained an algorithm for calculating activation energies of competing catalytic pathways. This novel strategy of utilizing activation energy differences of organocatalytic products has paved way of deploying ML algorithms to solve complex enantioselective catalyst systems.

3 Facilitating drug discovery and repurposing

Drug discovery process involves identifying new chemical entities as potential therapeutic agents. By now, it is realized that emerging infectious diseases (EIDs) are a part of the human race, AI-based methods are sought after for their predictive modelling. It is felt that intelligent systems, if in place, shall be able to predict emerging diseases, prior to its occurrence. ML methods are particularly robust, when applied as predictive model in drug discovery and public health. Figure 4 describes supervised, unsupervised or reinforcement learning to represent drug molecules and understand their therapeutic potential. Target validation, biomarker identification and computational pathology are the three key areas of drug discovery that have adapted DL methods particularly, for therapeutics in cancer, and most recently in SARS-CoV-2 disease. Considering the expanse of drug discovery, this section is particularly focused to highlight the recent efforts undertaken for discovering antiviral COVID-19 agents using advanced ML methods. A brief section is devoted to describe the recent progress witnessed in drug repurposing methods for COVID-19. For a more comprehensive review on drug discovery, readers can refer reviews by Dara et al. (2022), Kolluri et al. (2022), Shehab et al. (2022) and Pillai et al. (2022). Though, drug discovery is described separately, the understanding of organic synthesis is symbiotic with this field.

Fig. 4
figure 4

Supervised, unsupervised and reinforcement learning in drug discovery

3.1 Drug discovery for COVID-19

Drug discovery, particularly the stages of target drug identification, compound screening and preclinical studies necessitates tremendous scope for applying ML-based methods. If machine learning and deep learning techniques can assist in bringing a causal relationship between target novel molecule and the disease, drug discovery shall become cost and time efficient endeavour for pharmaceutical industries. In this section, a concise discussion is presented on the progress of drug discovery for antiviral agents against COVID-19.

Amilpur and Bhukya (2022) reported LSTM model for searching and generating novel molecules that can potentially bind with main 3CLPro protease of coronavirus. They screened about 2.9 million molecules from ChemBL, Moses and RDKit databases and represented by SMILES prior to deploying on generative LSTM model. Using binding affinity scores, 10 potential drug candidates were suggested by their model for treating infections. A state-of-the-art quantum computing ML-based framework was designed as an in silico tool for discovering novel drug candidates against COVID-19 (Mensa et al. 2022). A novel MP-GNN model and featurization was reported to designing COVID-19 drugs (Li et al. 2022). Their model comprised of two unique properties viz, multiscale interactions that utilized more than one type of molecular graph and simplified feature generation. They validated MP-GNN model with datasets from PDBbind. Over 185 complexes of inhibitors for SARS-CoV-2 were evaluated for their binding affinities using their unique model. Drug molecules and chirality have always presented a unique relationship. Exploiting this premise for natural remedies, natural products were screened for finding novel drug candidates. Vasighi et al. (2022) proposed a ML-based technique to classify and discover COVID-19 inhibitors obtained from natural products. They prepared docking protocol with 125 ligands and analyzed protein–ligand interactions and drug-likeness properties of inhibitors using statistical exploratory data analyses. Structural characteristics of SARS-CoV-2 especially the spike proteins were immensely investigated. It was revealed that Cathepsin-L (CSTL) increased the severity of COVID-related infections by activating spike protein of the coronavirus (Zhao et al. 2021). Hence, CSTL became a promising target and the search for their inhibitors were widely investigated using advanced DL-based techniques and statistical models. Yang et al. (2022a) reported DNN alongwith Chemprop for identifying novel molecules and approved drugs that blocked CSTL activity. Five molecules namely, daptomycin, Mg-132, Mg-102, Z-FA-FMK and calpeptin potentially blocked CSTL activity and alleviated severity of secondary COVID infections.

All these reports elicit that most researchers did not solely rely on vaccines, but rather focused on novel molecules as potential drug candidates for alleviating COVID infections. Inspite of public misinformation, vaccines are a safer therapy to combat the disease, albeit it cannot be entirely relied upon essentially due to resistance of mutant SARS-CoV-2 and subsequent breakthrough infections.

3.2 Drug repurposing for COVID-19

When the world was hit by COVID-19 pandemic, there was an urgent need to handle the spread of coronavirus and its treatment. With no vaccines then, the pandemic forced researchers to innovate and strategize antiviral treatment using AI-based techniques. This urgency also led researchers to find old drugs utilizing AI-based learning methods for treating COVID infection. This process of finding existing approved drugs for treating emerging diseases is called drug repurposing. As SARS-CoV and SARS-CoV-2 viruses display similar receptor binding mode (Lan et al. 2020), AI-assisted models utilized their structural data and predicted drug molecules that could alleviate COVID-19 symptoms. Until the vaccines arrived, these old, marketed drugs were repositioned for treating COVID-19 infected patients (Mohanty et al. 2020). The AI-assisted drug repurposing required an open drug database, repurposed drug database as input labels and then various algorithms are applied to them. All these processes generate the drug molecule required for the purpose. The critical issue in drug repurposing is the determination of a unique drug-disease relationship. AI-learning modelled with molecular descriptors, functional-class fingerprints (FCFPs), chemical fingerprints, and physico-chemical properties like partition coefficients could screen and identify drugs for treating coronavirus patients. It is revealed that drug repurposing for COVID-19 primarily utilized three types of algorithms viz, network-based (Ge et al. 2021), expression-based (Pham et al. 2021) and integrated docking simulations (Ahmed et al. 2022). Sibilio et al. (2021) examined three different network-based algorithms to identify potential drug molecules using transcriptomic data from the WBCs of COVID infected patients. They performed in silico studies that predicted drug-disease association and disease-likeness of COVID with other diseases. Yang et al. (2022b) demonstrated utilization of a novel web-server called D3AI-CoV for target identification and screening of drugs to combat COVID infections. They employed advanced DL-based models with canonical SMILES representation and more than 800 bioactives and 29 targets against nine coronavirus variants. Xie et al. (2022) proposed a compressed sensing algorithm combined with centered kernel alignment that shortlisted total 15 drug candidates as therapeutics for COVID-19.

Most of the reported literature focused on network-based, expression-based and docking simulation algorithms for identifying drug-disease relationships, viral gene expressions and host protein target interactions. It is argued that even with these reports, DL-based are limiting in scope while determining repurposed drugs for their potential use as COVID-19 treatment. Most of the DL-based methods require huge patient dataset that is not publicly available hindering the infection and survival predictions of COVID-19 infection. Hence, most of the reported literature utilized smaller data set that cannot be extrapolated for public health studies.

Proceeding with the discussion, the review now shifts focus to analytical chemistry especially, chemometrics and ML techniques on spectroscopy, microscopy and chromatography. A tremendous scope of successful chemometrics and ML-based techniques are witnessed in analytical chemistry. It is envisioned that, on further advances, automated analytical systems will be a reality. Following section describes the current progress of AI-based techniques and automation in spectroscopy, microscopy and chromatography.

4 AI and automation in analytical chemistry

Modern analytical techniques create huge data for heterogenous samples that needs to be interpreted by the chemists. Analytical chemists spend most of their time identifying and quantifying molecules in laboratory samples ranging from food, drug molecules to industrially important molecules. Chromatograms and spectra are generated that undergo chemometric and standard mathematical algorithms to derive useful information, though a huge subset of data remains ignored. Earlier, the Library Search Algorithm was employed to obtain crucial information about molecular structures from spectral data. Today, the situation has matured to a certain extent that utilizes machine learning techniques such as, convoluted neural networks on spectral peaks, microscopic images and chromatograms.

Prior to data interpretation, chemical data retrieved from instrumental techniques are composed of distortions called artefacts. These artefacts are caused due to noise levels from instruments, sample type, solvent effects and physico-chemical factors. Their presence in spectral and chromatogram data adversely affects crucial data sets leading to loss of chemical information. When these distortions are eliminated or suppressed for data enhancement, it is called pre-processing method. It involves correcting peak shifts, baseline corrections, noise removal, stray light suppression and retrieving missing data values (Chalmers 2006). In the following section, chemometrics and ML methods employed in spectroscopy, microscopy and chromatography are discussed (refer Fig. 5).

Fig. 5
figure 5

Overview of chemometrics and ML methods applied to analytical techniques. Spectroscopy, chromatography and microscopy are depicted on the left panel (not drawn to scale, not representative of any data). The right panel depicts chemometrics and ML models applied on analytical data after pre-processing. Finally, it depicts navigation towards automation that utilizes IoT, sensory devices, flow chemistry and mobile robots

4.1 Chemometrics and machine learning in spectroscopy

Chemometrics became successful as a statistical technique for its application in near-infrared spectroscopy. Near-infrared spectra contain deeply convoluted signals that are not separated by baseline, thus making it difficult to quantify crucial information regarding molecules. Vibrational spectroscopy, NMR, MS, and hyphenated techniques generate multidimensional spectral data that contains a plethora of critical information related to molecular structures. These data are optimized and studied using chemometrics and machine learning methods with enhanced precision and accuracy.

4.1.1 Vibrational spectroscopy

NIR, IR and Raman spectroscopy are typical vibrational spectroscopic methods that derives structural information by measuring vibrations of molecules. An open-source python module called “nippy” was employed for NIR spectral data (Torniainen et al. 2020). Roger et al. (2020) reported the utilization of sequential and orthogonal PLS regression for pre-processing NIR spectral data of wheat grains, tablet and meat samples. Martyna et al. (2020) applied genetic algorithm to Raman spectral data. The deployed genetic algorithm assessed the pre-processing technique by calculating variance ratios and validated it by applying it on forensic Raman spectral data. Post chemical data enhancement, they applied various chemometric algorithms to obtain critical information from spectroscopic data. Raman and SERS techniques produce complex vibrational spectra of chemical mixtures which are exploited to obtain critical information.

Until now, LR analysis was performed to obtain useful data from Raman and SERS vibration spectra, but deep learning has replaced these statistical models. Weng et al. (2020) modelled a deep learning with CNN and PCANet that identified drugs in human urine with an accuracy above 98.05%. They also measured pirimiphos-methyl in wheat extract that was quantified using fully convoluted neural networks with a determination coefficient of 0.9997. Table 2 lists selected spectroscopic techniques, different chemometric and ML-based techniques with their potential applications.

Table 2 Selected spectroscopic techniques, different chemometrics and machine learning methods and their applications

4.1.2 NMR spectroscopy and mass spectrometry

NMR spectroscopy and mass spectrometry are sophisticated analytical techniques that provide critical information on type of nuclei and m/z of chemical molecules respectively. Particularly, deep neural networks gained importance in NMR spectral interpretation to enable time-efficient data acquisition and lower chemists’ training endeavours (Chen et al. 2020). Kong et al. (2020) reported deep learning through CNN coupled with sparse matrix completion to suppress noise and speeding up 2D nanoscale NMR spectroscopy. A momentum of interest was witnessed for DNNs being utilized for reconstructing non-uniformly sampled NMR to enhanced resolution at shorter time (Hansen 2019; Karunanithy and Hansen 2021). One particular concern was to unravel critical structural information from multidimensional NMR spectra obtained during metabolomic studies. Metabolomic study generates large data with crowded NMR spectral peaks and hence peak picking is an old yet a hard problem. Conventional peak picking methods in a routine NMR instrument may be insufficient. The specialized DNNs are providing respite to analytical chemists to decode these utilizing advanced GUI interface (Rahimi et al. 2021) and DNNs (Li et al. 2022). Native MS spectrometry is utilized for unravelling macromolecule structures particularly nucleic acids and proteins. An intriguing study was reported by Allison et al. (2022) on applying native MS for structural elucidation of selected protein complexes that complemented ML methods.

When spectrometric methods are combined with chromatography, they are called hyphenated techniques. Hyphenated techniques such as, LC–MS, GC–MS, etc. produce multidimensional data that requires advanced DL techniques for data interpretation. Qiu et al. (2018) reported GC–MS data interpretation without spectral library database query and efficiently prioritized biological candidate molecules by orthogonal datasets of retention indices, mass spectra and other physicochemical parameters of compounds. Recently, a deep learning algorithm called ‘peakonly’ was developed by Melnikov et al. (2020) that provided precise peak identification and integration in LC–MS data.

4.2 Chemometrics and machine learning in microscopy and chromatography

The advances in chemometrics and ML methods have led to utilizing them in chemical data image processing in electron microscopy, atomic force microscopy and 2D chromatographic techniques. It has allowed insights to crucial information about the molecular structures where chemical images are obtained either as grayscale or hyperspectral images.

In this section, the recent advancement of ML methods reported in imaging techniques and chromatography is explored.

4.2.1 Atomic force microscopy (AFM)

AFM is an advanced analytical topographic imaging technique that produces high—resolved images at atomic resolution allowing nanoscale characterization of important materials such as biological and inorganic samples. Previous attempts were made to minimize heuristic probe conditioning while imaging using algorithms (Villarrubia 1997), inverse imaging of probe (Schull et al. 2011; Welker and Giessibl 2012; Chiutu et al. 2012) and probe manipulation in atomic force microscopy (Paul et al. 2014). However, these methods are not suitable for large dataset acquisition. AFM imaging presents challenges such as scan speed, optimization, and artefacts in scanned images. An autonomous atomic force microscopy utilizing an AI framework was reported by Krull et al. (2020) that allowed probe quality assessment, conditioning, and its repair along with large data acquisition. Javazm and Pishkenari (2020) proposed adaptive and multi-layered neural fuzzy inference system NNs for solving the problem of AFM restricted scan speeds. Payam et al. (2021) reported AFM data acquisition and imaging using continuous wavelet transform on photodetector data. Their approach generated data rapidly and provided information of amplitude and phase for AFM probe with variation of sample materials.

4.2.2 Electron microscopy (EM)

In EM, an electron beam illuminates the sample to generate an image that provides critical information of surface characteristics and their detailed morphology. In EM imaging technique, chemists scan selected regions of the sample and assess the quality of the image based on their past experiences. If the chemist considers the EM scan as a poor-quality image, they shall change the conditions of the instrument and rescan another region of the specimen. Thus, most of the endeavour is based on trial- and error that is often time-consuming due to optimizing specimen region scan, probe type, voltage pulse between specimen and the probe for obtaining highly—resolved images.

Ilett et al. (2020) reported a validated automated agglomerate measurement for characterizing dispersion of nanoparticles in biological fluids using machine learning open-source software called ilastik and CellProfiler. Their approach utilized automated STEM imaging to obtain statistically relevant image data coupled with machine learning analysis. Further, the approach was extrapolated to confirm FeO nanoparticles agglomerate in cell culture medium that was deficient of surface-stabilising serum proteins. Yu et al. (2020) applied semantic image segmentation technique to analyze pore spaces of sandstone and its relationship with permeability characteristics. Their work demonstrated deep learning using neural networks precisely recognizing SEM images that led to improved identification of pores in sandstone samples. Wang et al. (2021) developed an unsupervised ML algorithm for automated transmission electron microscopic image analysis of metal nanoparticles. They explored the automated algorithm on palladium nanocubes and CdSe/CdS quantum dots that showed quantitative results.

4.2.3 Chromatography

Chromatography is a separation technique that involves partitioning of individual compounds of complex mixtures between mobile and stationary phases. This method of separation faces a problem of peak overlaps and analysing one type of data over time. With need to separate multiple samples with complex matrices led to development of 2D chromatography. 2D chromatography uses two chromatographic columns with different phases. During separation run, the sequential aliquots collected from the first chromatographic column are reinjected onto the second chromatographic column (Jones 2020). Thus, the components that could not be separated in the first column, get separated in the second column. The resulting data after separation is plotted in 2D or 3D space leading to complex data generation that is essentially solved using algorithms (Huygens et al. 2020).

Pérez-Cova et al. (2021) developed ROIMCR (Region of Interest Multivariate Curve Resolution) method for 2D liquid chromatographic separation method. Retention index (RI) is a critical parameter of chromatography that depends on the chemical structure and type of stationary phase employed during chromatographic separation. Several efforts are taken to determine retention indices that enhances the identification of analyte moelcules. Matyushin and Buryak (2020) utilized four machine learning models viz, 1D and 2D CNN, deep residual multilayer perceptron, and gradient boosting. They described molecules for input labels as strings notation, 2D representation, molecular fingerprints and descriptors in all the four machine learning models. The model was deployed and tested on flavoring agents, essential oils and metabolomic compounds of interest and exhibited error of about 0.8–2.2% only. Further, they utilized a free software, thereby demonstrating their models as being easily transferrable on a lab bench towards automation. Vrzal et al. (2021) proposed DeepReI model based on deep learning for accurate retention index prediction. They used SMILES notation as input labels and a predictive model of 2D CNN layers that had percentage error of < 0.81%. Qu et al. (2021) described the training of graph neural networks to predict retention indices for NIST listed compounds and compared the results with earlier published work. They demonstrated that RI predictive, systematic and data-driven approach of deep learning outperforms previous machine learning models.

5 Challenges, opportunities and future perspectives

Organic synthesis, drug discovery and analytical techniques are no longer a sole human activity that requires numerous experiment protocols and reaction optimization. Even with a significant uptick of ML methods in chemistry, we are facing failures in applying them. As uncomfortable as it may sound, there are some serious problems which are presented as questions below and their subsequent reflections:

  1. (1)

    How mature is the status of machine learning and chemometrics in chemistry?

  2. (2)

    Are we training and deploying ML models in chemistry in the right manner? and;

  3. (3)

    Can we completely automate our chemical laboratory bench?

It is already known that utility and application of ML models in chemistry rely heavily on quality and quantity of data. In most chemical experiments, protocols are based on previously optimized reaction conditions that lack reproducibility (Bergman and Danheiser 2016). Over the years, chemical data reproducibility issues are being addressed that includes, reaction optimization with minimal information (Reker et al. 2020; Shields et al. 2021). Efforts are initiated in this direction by employing FAIR guidelines (Wilkinson et al. 2016) for chemical data. These guidelines are now transpiring as a research consortium amongst chemists for data sharing practices and fostering digital chemistry culture (Herres-Pawlis et al. 2019). Next, chemical process optimization that remained in dormancy is gradually showing progress as flow chemistry methods (Cherkasov et al. 2018). Mateos et al. (2019) reported continuous flow self-optimization platforms that included intelligent algorithms and monitoring techniques for a chemical reaction. Inspired by FAIR guidelines, two novel open-source machine learning benchmarking frameworks Summit (Felton et al. 2021) and Olympus (Häse et al. 2021) were reported for rapid optimization of reaction conditions. In same breath, it is reiterated that the emergence of ML in organic synthesis must not take away the elegance of discussing synthetic routes amongst chemists.

Though ML methods are transforming chemistry, yet these methods must not be exaggerated as we navigate on the Gartner hype cycle of AI (not a cycle, but a curve) (Gartner 2022). When Beker et al. (2022) investigated application of ML model on Suzuki–Miyaura reaction optimization, it was quite evident that data acquisition is a problem. Most of the data fed to machines are extracted from published journal papers and patents that are skewed towards high yielding reactions. Hence, the bias creeps in the data thereby, causing the ML deployment in organic synthesis planning dicey. As we advocate the success of AI in chemistry, we need to obtain reproducible data of high yielding reactions and standardize low yielding reactions. Utilizing and augmenting both the data sets is a better proposition, rather than merely feeding huge datasets of popular organic reactions. The same scenario holds true for drug discovery where, medicinal chemists are searching for drug molecules in infinity chemical space. Recalling Lipinski’s idea of chemical space, medicinal chemists are utilizing rule of five (or Ro5) to search drug molecules (Lipinski 2016). DL methods are robust techniques when applied to drug discovery and repurposing. These holds promise in prediction modelling studies of emerging diseases for potential target identification. Medicinal chemists have plethora of choices to represent molecules. Apart from SMILES and SMARTS notations, Coulomb matrices, bag-of-bonds, fingerprinting and deep tensor networks are successfully implemented to find druggable molecules.

Another concern of experimental chemists is the failure to generate large datasets for “data crazy” ML models. It necessitated the application of transfer learning in chemistry that allowed algorithms to extract knowledge from the pre-trained model. Apart from a standard dataset, the pre-trained model with a similar application task as the target set is fed to the machine model to enhance performance. Few reports were published that trialled for applying transfer learning in chemical science (Tran et al. 2017; Wen et al. 2022). However, one cannot be fool-proof with transfer learning if the chemist chooses a pre-trained model dataset that has lower similarity index with the target set.

The final question is based on the premise that chemists are data generators whereas, computer scientists are programming experts. We are convinced that machines are good with images; hence their application on spectral, chromatogram and microscopic data is less problematic. The images are broken down to pixels and affixed a numeric value which is fairly easy yet, images generated are with artefacts. Artefacts are resolved easily with chemometric pre-processing methods prior to deploying ML models to extract crucial information. As AI-based models are good at deriving critical information from large high-quality data sets, it is possible to deploy them in atomic force microscopy, chromatography, and spectroscopy as discussed in Sect. 4. These sophisticated analytical methods have large data sets available for training and easily available to chemists. Machine learning models are easily navigating in the different areas of analytical techniques, although cannot be fully automated, as the analytical instrument hardware are designed to be operated by humans.

Just as we wonder if AI is a dream for chemists, some path-breaking reports on mobile robots (Peplow 2014; Burger et al 2020; Fakhruldeen et al 2022) on a chemical lab bench brings our hype back. A chemical reaction robotic system controlled by machine-learning algorithm explored over 6000 organic reactions faster than those carried out by synthetic chemist’s laboratory processes (Granda et al. 2018). All efforts discussed by far, are signalling towards digitization of chemical laboratories. However, automation in chemistry is not new, in fact the earliest attempt on chemical automation was demonstrated by Merrifield (1965) called solid phase peptide synthesis. Till date, solid phase peptide synthesis finds its applications in biochemical laboratories. We are not far from automated lab bench with sensory devices, IoT, digital twin and robust hardware in place. Yet, the scaling-up of the robotic work-flow from lab to industrial bench needs practical augmentation. We are in a triad of hope, disillusion and productivity when it comes to reflecting AI in drug discovery, organic synthesis and analytical methods, respectively (Fig. 6).

Fig. 6
figure 6

Triad of hope, disillusion and productivity in drug discovery, organic synthesis and analytical chemistry, respectively

Digging further, most of the published literature lacks author diversity that go beyond gender. Few laboratories are working in silos on AI applications in chemistry, which is plausible, considering data privacy issues and funding constraints. Chemists, engineers, mathematicians and data scientists need to have a dialogue and solve the challenging problems of chemistry collaboratively. Automated robots in chemical laboratories are daunting task for scientists, especially those coming from middle-income nations, where grant funding is a problem. It is argued that, AI-based applications must go beyond borders and fruitfully contribute to the research community. One example is DREAM Challenges, a competition solving challenging problems in biomedicine that elicits the need for more such platforms. Such competitions, if explored for chemistry, shall stir up discussions leading to solving complex problems through diverse collaborations. Another perspective is to introduce ML in chemistry curriculum that focuses only about solving chemistry. There are separate programs for machine learning and artificial intelligence, yet, these courses are curated for engineers rather than chemists. This effort of ML in chemistry curricula shall inspire young chemists to design their own machine algorithms to solve chemistry problems, without taking away the collaborative spirit of interdisciplinary AI research.

With a renaissance of Industry 4.0, chemometrics and machine learning have yet to explore and provide solutions to chemical problems. However, we are not far from reaching advanced ML-based solutions for the challenge. It is well understood that AI essentially derives power by learning from data; in this case, chemical data. If the flaws of data acquisition get resolved for chemical patterns, ML methods shall function more than an auxiliary-checkbox and navigate to explore the intricate chemical world.