1 Introduction

In recent years, the wide range of data in various industries has brought about both benefits and difficulties. Modern data generation exceeds conventional techniques of analysis in terms of volume, velocity, and diversity, resulting in it becoming increasingly difficult for individuals to extract useful information on their individual. An effective remedy for this issue is machine learning (ML), a branch of artificial intelligence that provides automated techniques for getting beneficial knowledge from enormous quantities of data [1].

ML algorithms have shown significant potential in various fields such as healthcare, finance, e-commerce, transportation, and autonomous systems. ML models in healthcare can accurately evaluate medical images to identify diseases such as cancer, aid in diagnosing illnesses, forecast patient outcomes, and suggest individualized treatment strategies. ML algorithms are used in finance for fraud detection, algorithmic trading, risk assessment, and client segmentation to enhance efficiency and profitability [23]. ML in autonomous systems, such self-driving vehicles and drones, allows for immediate decision-making using sensor data, resulting in safer and more dependable performance.

The optimization procedure plays an important role for the success of ML models. Optimization is the process of repeatedly modifying the parameters of a model to reduce or increase a preset objective function, such a loss function in supervised learning [4]. The objective of optimization is to identify the parameters that most accurately match the provided data and perform well on novel information, thereby enhancing predicted accuracy and performance.

Optimization approaches play an important role in training ML models for obtaining high accuracy, efficiency, and generalization abilities. Optimization algorithms allows models to learn from data, adapt to complicated patterns, and generate reliable forecasts by modifying model parameters. The selection of an optimization method can greatly impact the efficiency and convergence of a ML model, making it an essential decision throughout the model development process [5].

Gradient descent is a basic optimization procedure used in many ML methods, such as neural networks. Gradient descent variations including Stochastic Gradient Descent (SGD), mini-batch gradient descent, and Adam optimizer are effective at optimizing large-scale models with millions of parameters. Bayesian optimization is a widely used approach that probabilistically searches the parameter space to identify the best configuration for hyperparameters, hence improving model performance and generalization [67].

Regularization methods like L1 and L2 have become essential for reducing overfitting and enhancing the generalization abilities of ML models. Regularization, achieved by including a penalty term into the loss function, supports the acquisition of more straightforward representations by the model [8]. This serves to minimize the probability of recollecting irrelevant details in the training data and enhances the model’s performance on novel, unknown data.

Hyperparameter optimization techniques like grid search, random search, and Bayesian optimization assist in determining the most suitable hyperparameters, improving ML model performance together with optimization algorithms. Hyperparameters like learning rate, batch size, and regularization strength have a substantial influence on the behaviour and effectiveness of models, necessitating meticulous adjustment for best results [9].

Optimization methods are essential in ML since they allow models to learn from data, adjust to intricate patterns, and provide precise predictions in different fields. The increasing amount and complexity of data necessitate the creation of efficient and scalable optimization algorithms, which are crucial for advancing ML and exploring new potential for innovation and discovery [10].

An in-depth explanation of optimization approaches used to a wide range of ML issues across several domains is outlined in Table 1. Each case study provides a detailed examination of a genuine problem, including the dataset, optimization method, and achieved performance metrics. The table presents valuable insights on the effectiveness and flexibility of optimization approaches in addressing intricate issues with a focus on real-world applications in many fields such as healthcare, finance, transportation, and environmental research. By using these case studies, experts and researchers may understand how optimization approaches can be tailored to particular domains, allowing for the creation of educated problem-solving strategies and decision-making processes in real-world scenarios.

Table 1 Case studies of optimization in real-world applications

1.1 Purpose and Structure of the Paper

This study aims to provide an extensive overview of ML optimization approaches, categorize them according to their goals and features, recognize typical obstacles, and suggest potential research paths in the domain. The presentation will start by examining current literature on optimization methods in ML, emphasizing important advancements and patterns. It will then provide a structured categorization of optimization methods, including different algorithms, approaches, and methodologies. The paper will explore the problems of ML optimization, including complications in the optimization landscape and scalability concerns. The paper will explore prospective directions for future study, such as creating new optimization algorithms, tackling scaling issues, and improving the interpretability of optimization outcomes. The study seeks to provide significant insights into the current state of ML optimization and direct future research in this quickly changing exposed using a systematic method.

2 Background and Literature Review

The recent development of optimization strategies in ML has originated in the early stages of artificial intelligence research. The approaches have advanced considerably over time in response to the challenges of learning from data and developing intelligent systems. Here is an in-depth elaboration of the historical milestones mentioned:

Frank Rosenblatt’s perceptron algorithm belongs to the first examples of optimization methods in the field of ML. The perceptron, introduced in the late 1950s, established the groundwork for learning systems based on neural networks. The system used a basic version of gradient descent to modify the weights of connections between neurons in order to reduce classification mistakes [11].

Paul Werbos introduced backpropagation in the 1970s, which was a major improvement in optimization methods for neural networks. The approach facilitated the effective training of multi-layer neural networks by backpropagating mistakes and changing weights by gradient descent. Backpropagation transformed neural network research and established the scene for the revival of deep learning several decades later [12].

Nouretdinov, Ilia, et al. introduced support vector machines (SVMs) in the 1990s, which marked a significant change in ML optimization. SVMs have shown the efficacy of convex optimization methods in classification applications by offering a strong structure for optimizing the margin between classes. SVMs emerged popular due to their capacity to manage high-dimensional data and generalize well to novel instances [13].

Convex optimization approaches became popular in ML in the early 2000s because of its theoretical assurances and computational efficiency. Stephen Boyd and Lieven Vandenberghe provided substantial contributions to the advancement of convex optimization techniques and their utilization in ML [14]. Convex optimization has established a strong mathematical basis for several ML models such as logistic regression, linear SVMs, and robust PCA.

The 2015s observed a renewed interest in neural networks and the emergence of deep learning. Advancements in optimization algorithms like SGD, Adam, and RMSprop were essential in facilitating the training of deep neural networks with multiple levels. Geoffrey Hinton, Yoshua Bengio, and Yann LeCun were important individuals in showcasing the effectiveness of deep learning in computer vision, natural language processing, and voice recognition across several fields [15].

In recent years, there has been a trend towards investigating new optimization strategies that expand beyond conventional gradient-based approaches. Evolutionary algorithms, metaheuristic optimization, and Bayesian optimization are effective methods for solving non-convex optimization problems and navigating intricate search spaces. Furthermore, progress in technology, such as the development of specialized accelerators like GPUs and TPUs, has hastened the training of extensive ML models.

The present state of research on optimization strategies in ML includes a diverse collection of studies that cover algorithmic advancements, theoretical understandings, and real-world implementations. Many works by well-known authors have greatly progressed the area, illuminating different elements of optimization. Here is an in-depth analysis of the literature and the significant contributions made by prominent authors.

Authors [16] are prominent characters who have made significant advancements in optimization approaches within the area of ML. They have developed important algorithms such as SGD and Adam, which are now essential for training neural networks. Authors study established the foundation for comprehending the significance of optimization in deep learning and its influence on model effectiveness.

Authors [17] work has significantly advanced our knowledge of optimization issues in training deep neural networks. His research has provided insights into problems like disappearing gradients and inflating gradients, which provide major obstacles in improving deep neural networks. His research has led to the development of methods such as batch normalization and skip connections that tackle these challenges and enhance the robustness of training.

Recent research has delved into the convergence of optimization with other disciplines including reinforcement learning, meta-learning, and federated learning. Researchers such as [18] have studied optimization methods for reinforcement learning algorithms to enhance sampling efficiency and convergence characteristics. Researchers like as [19] have investigated optimization techniques for meta-learning, which entails acquiring the ability to learn across several tasks and domains. Federated learning, developed by researchers such as [20], aims to enhance models collectively across decentralized devices while safeguarding data privacy, leading to distinctive optimization difficulties and possibilities.

The literature on optimization strategies includes both theoretical studies and actual implementations in many fields [21]. have advanced the theoretical underpinnings of convex optimization by establishing precise mathematical structures to elucidate optimization techniques and their characteristics. Researchers such as [22] have utilized optimization techniques in computer vision, natural language processing, and robotics to effectively solve complex tasks, showcasing the efficacy of optimization-driven approaches.

It is crucial to identify common issues and trends in optimization strategies in ML to comprehend the existing situation and direct future research efforts. Here is a detailed examination of common obstacles and upcoming patterns:

One of the main obstacles in optimizing ML is handling non-convex goal functions. Several practical optimization problems have non-convex surfaces, which may result in suboptimal solutions and convergence challenges. Researchers like [23] have tackled these obstacles by creating adaptive optimization algorithms such as Adam. Adam modifies learning rates and integrates momentum to better negotiate non-convex terrains. Non-convex optimization continues to be a focus of study, with efforts directed towards developing more resilient algorithms that can effectively deal with intricate optimization surfaces.

Optimization algorithms have a significant problem in terms of scalability, especially in the presence of massive data and intricate model designs. As datasets expand and models get more intricate, conventional optimization methods may face challenges in meeting computing requirements [24]. have investigated distributed training and model parallelism strategies to address scalability issues efficiently. These methods include spreading out computations over numerous devices or machines to facilitate effective training on large datasets and intricate models. Scalability is still a challenge, particularly with the emergence of new paradigms like federated learning and edge computing.

Efforts to develop more effective and resilient optimization algorithms are a prominent focus in ML research. Conventional optimization methods might experience problems including sluggish convergence, sensitivity to beginning conditions, and vulnerability to adversarial assaults. Researchers are investigating innovative methods including evolutionary algorithms, metaheuristic optimization, and quantum-inspired optimization to tackle these difficulties [25]. These methods use concepts from biology, nature, and quantum physics to create optimization algorithms that are more effective and resilient. Collaborative research combining optimization, computer science, and physics is advancing innovation in this field, resulting in the creation of advanced optimization methods that provide enhanced performance and dependability. The optimization methods utilized in ML are comprehensively summarized in Table 2. The article accentuates the vast array of research initiatives devoted to advancing the bleeding edge of optimization algorithms and overcoming common challenges. In addition to optimizing training for particular neural network and ML model types, the objectives pertain to scalability concerns, non-convex optimization challenges, and optimization challenges in deep learning, reinforcement learning, and natural language processing. In addition to investigating distributed training, model parallelism, regularization techniques, and specialized optimization strategies tailored to particular domains and model architectures, the proposed solutions include the development of adaptive optimization algorithms such as Adam and SGD.

Optimization strategies in ML have evolved throughout time due to the pursuit of effective, scalable, and reliable algorithms. Optimization has played a crucial role in shaping the discipline from early perceptron’s to the deep learning revolution. Literature encompasses a wide range of works, from fundamental methods such as SGD to sophisticated implementations in reinforcement and federated learning [26]. Interdisciplinary techniques are necessary to address difficulties such as non-convexity and scalability. Collaboration will be essential for developing optimization methods and their practical applications as the subject progresses.

Table 2 Authors’ contributions to advancing optimization in machine learning

3 Survey of Machine Learning Optimization Techniques

ML optimization methods have significance for modifying model parameters to achieve optimal results in tasks including classification, regression, clustering, and reinforcement learning. This study intends to provide an in-depth review of the various optimization strategies used in ML, including various objectives, algorithms, and perspectives.

3.1 Classification of Optimization Techniques Based on Optimization Objectives

ML optimization strategies are often classified according to the specific objectives they aim to improve. These goals usually consist of accuracy, speed, and generalization performance, all of which are essential for the overall efficiency of ML models.

3.1.1 Accuracy-Oriented Optimization Techniques

Precision-focused optimization techniques like SGD, Adam, AdaGrad, RMSprop, Nesterov Accelerated Gradient (NAG), and Adadelta are crucial for refining machine learning models to get superior prediction accuracy and reduce error measures. The strategies concentrate on improving model parameters via the adjustment of learning rates, the use of momentum, and the efficient exploration of optimization landscapes. Adam and RMSprop algorithms are proficient in optimizing complex, non-convex objective functions in deep learning by adjusting learning rates dynamically and considering past gradients [4445]. NAG is a technique that improves standard gradient descent by adding momentum, which accelerates convergence and enhances performance. Adadelta adjusts learning rates automatically by considering gradient magnitude and update history, removing the need for human modifications. These accuracy-focused optimization strategies allow practitioners to effectively train models and achieve higher performance in many machine learning applications.

3.1.2 Speed-Oriented Optimization Techniques

Optimization strategies focused on speed are essential for effectively training ML models, especially when utilizing large-scale datasets and computationally demanding tasks. The strategies aim to optimize the training process to decrease computing time and resource requirements, as well as resulting in quicker convergence and enhanced efficiency.

  1. 1)

    Mini-Batch Gradient Descent: Mini-batch gradient descent is an adaptation of the gradient descent optimization process that splits the training data into smaller batches. Mini-batch gradient descent estimates gradients by utilizing a subset of data samples, compared to batch gradient descent which uses the complete dataset or SGD which uses a single data point [46]. Mini-batch gradient descent allows for more frequent updates as well as faster convergence by adjusting model parameters using gradients calculated from each mini-batch, as opposed to batch gradient descent. Practitioners may take use of both stochastic and batch gradient descent advantages by using this technique, achieving a compromise between computing economy and convergence speed.

  2. 2)

    Parallelization and Distributed Training: Parallelization approaches divide computing tasks across numerous processing units like GPUs, TPUs, or distributed clusters. Parallelization approaches expedite training durations and enhance computational efficiency by handling several data points concurrently across several processing units. Distributed training involves sharing both the model parameters and data over numerous devices or processors to parallelize the process [47]. This enables simultaneous calculation of gradients, changes to parameters, and assessments of the model, resulting in substantial reductions in training duration and resource needs. Frameworks like as TensorFlow, PyTorch, and Horovod provide strong support for distributed training on different hardware setups, enabling practitioners to use parallelization and distributed computing methods to accelerate the training process.

  3. 3)

    Momentum Optimization: Momentum optimization is a method that improves the conventional gradient descent by including momentum into the parameter updating process. Momentum optimization involves accumulating a moving average of prior gradients to guide parameter changes, instead than depending just on the present gradient. Momentum optimization may expedite convergence by using momentum, particularly when dealing with noisy or sparse gradients [48]. The momentum term reduces oscillations and aids in navigating steep optimization landscapes more efficiently, resulting in quicker convergence throughout the optimization process.

  4. 4)

    Adaptive Learning Rate Scheduling: Adaptive learning rate scheduling approaches modify the learning rate during training by considering parameters including gradient magnitudes, parameter updates, and convergence progress. Examples of learning rate schedules include the learning rate decay, which decreases the learning rate over time, and learning rate warm-up, which boosts the learning rate at the beginning of training [49]. Adaptive learning rate scheduling approaches optimize the training process by adjusting the learning rate based on the optimization landscape, leading to quicker convergence and enhanced stability.

  5. 5)

    Model Quantization and Compression: Model quantization and compression approaches seek to decrease the computational and memory requirements of ML models by expressing parameters and activations with a reduced number of bits. Quantizing model parameters and activations to lower precision, such as 8-bit integers, using model quantization and compression approaches may greatly decrease memory bandwidth needs and speed up inference on hardware systems with restricted computing capabilities [50]. Methods like as quantization-aware training and post-training quantization allow users to train models with lower accuracy without significant loss in performance, making them ideal for speed-focused applications with limited resources.

  6. 6)

    Early Stopping and Model Checkpointing: Early stopping and model checkpointing are methods used to supervise the training process and terminate training when certain convergence criteria are reached or performance indicators level off. Early stopping strategies enable practitioners to halt training prematurely by assessing the model’s performance on a validation set at intervals, thereby averting overfitting and decreasing the consumption of training time and computing resources [51]. Model checkpointing strategies enhance early halting by intermittently storing copies of the model parameters during the training process. These checkpoints enable the continuation of training from intermediate stages, recovery from failures, and deployment of trained models for inference. They help optimize the use of computing resources and support repeatability and scalability in ML processes.

Speed-focused optimization methods like momentum optimization, adaptive learning rate scheduling, model quantization and compression, and early stopping/model checkpointing are crucial for speeding up the training process and cutting down on computational costs in ML tasks. By using these methods, professionals may maximize the use of computing resources, speed up model convergence, and expand their ML processes to manage bigger datasets and more intricate models efficiently.

3.1.3 Generalization-Oriented Optimization Techniques

Optimization strategies focused on generalization are crucial for ensuring ML models generalize well to novel data and prevent overfitting. These methods focus on enhancing model parameters to enhance the model’s capacity to generalize, leading to more resilient and dependable performance on novel instances.

  1. 1)

    Regularization: Regularization methods like L1 and L2 have a tendency to avoid overfitting by punishing high parameter values [5253]. Regularization strategies enhance the learning of simpler models that can generalize more effectively to novel data by including regularization terms into the objective function.

    • L1 Regularization (Lasso): L1 regularization introduces a penalty term based on the absolute value of the model’s parameters into the loss function. This penalty promotes sparsity in the parameter space, which decreases the model’s complexity and removes unnecessary features.

    • L2 Regularization (Ridge): L2 regularization introduces a penalty term that is directly related to the squared size of the model’s parameters into the loss function. This penalty promotes the use of lower parameter values and smoother decision boundaries, which aids in preventing overfitting and enhancing generalization performance.

  2. 2)

    Dropout: Dropout is a regularization method utilized in neural networks to inhibit the co-adaptation of neurons and enhance generalization performance. Dropout randomly deactivates a portion of neurons in each layer during training, which adds noise and redundancy to the network [54]. This randomness promotes the network to acquire more resilient characteristics and avoids over-reliance on particular neurons or features, leading to enhanced generalization of novel inputs.

  3. 3)

    Data Augmentation: Data augmentation is a method that artificially enhances the variety of the training dataset by implementing various modifications on the input data. Data augmentation involves applying transformations like rotations, translations, flips, and zooms to training samples [55]. This process exposes the model to a wider array of data variances, promoting the learning of more invariant and robust features. Augmentation may be especially advantageous in situations with little or unbalanced training data, allowing the model generalize more effectively to novel instances and enhance its overall performance.

  4. 4)

    Early Stopping: Early stopping is a regularization method that prevents overfitting by evaluating the model’s performance on a separate validation set during training and interrupting the training process when performance begins to deteriorate. Early stopping prevents overfitting by ending training before the model memorizes noise or outliers in the data, ensuring better generalization to new samples [56]. This strategy promotes the model to acquire more generalizable characteristics and avoids it from becoming too intricate, thus enhancing its capacity to generalize to new data.

  5. 5)

    Ensemble Learning: Ensemble learning approaches amalgamate many base models to create a more resilient and precise composite model. Ensemble approaches combine predictions from numerous models to enhance generalization performance by reducing variation and bias of individual models [57]. Bagging, boosting, and stacking techniques use a variety of base models trained on distinct portions of the training data or using various methods to capture complementing characteristics of the underlying data distribution. Ensemble learning efficiently utilizes the combined knowledge of numerous models to provide more reliable predictions and improve generalization to new instances.

  6. 6)

    Transfer Learning: Transfer learning is a ML method that utilizes information acquired from a source domain or job to enhance performance on a target domain or task. Transfer learning allows models to improve their ability to generalize to new data and tasks with limited annotated training data by transferring representations learnt from pre-trained models trained on large-scale datasets [58]. Transfer learning is beneficial when the target task lacks adequate training data or when the source and target domains have comparable fundamental properties. Transfer learning improves generalization performance and speeds up model convergence by using pre-trained models as feature extractors or fine-tuning their parameters for the target task. This makes it a valuable tool for optimization focused on generalization.

Generalization-oriented optimization approaches including data augmentation, early stopping, ensemble learning, and transfer learning are crucial for enhancing the resilience and generalization capabilities of ML models. By integrating these methods throughout the training process, professionals may efficiently combat overfitting, promote the acquisition of more accurate features, and enhance the model’s capacity to generalize to novel data and tasks.

3.2 Popular Optimization Algorithms

Optimization algorithms are essential in training ML models since they constantly update model parameters to minimize or maximize an objective function. This article presents an overview of popular optimization techniques in ML, each with distinct methods and compromises to achieve effective model optimization.

3.2.1 Gradient Descent and Its Variants

Gradient descent is an important optimization approach that reduces objective functions by repeatedly modifying model parameters in the direction opposite to the gradient [59]. Different versions of gradient descent provide various approaches for adjusting model parameters and are designed to tackle distinct obstacles faced during optimization. Here, we explore several notable variations:

  1. 1)

    Stochastic Gradient Descent (SGD): SGD improves the model assumptions by utilizing gradients calculated from a randomly chosen subset of training instances, known as mini-batches [60]. SGD provides randomness into the optimization process, resulting in quicker convergence and lower memory use compared to batch gradient descent. SGD rapidly navigates complicated optimization landscapes and handles large-scale datasets by updating parameters based on mini-batches. SGD’s convergence may vary more than batch gradient descent because of the stochastic nature of gradient estimation.

  2. 2)

    Adam (Adaptive Moment Estimation): Adam is a learning rate optimization technique that integrates principles from momentum and RMSprop [61]. It adapts learning rates for individual parameters by considering previous gradients and squared gradients, which is effective for optimizing intricate, non-convex objective functions often seen in deep learning applications. Adam can efficiently manage sparse gradients, noisy data, and non-stationary targets by adjusting learning rates according to gradient magnitudes and variance. Adam’s adjustable learning rate approach allows for quicker convergence and improved performance across many optimization problems.

  3. 3)

    RMSprop (Root Mean Square Propagation): RMSprop is an optimization approach that tackles the issue of decreasing learning rates in AdaGrad by using a moving average of squared gradients to dynamically adjust learning rates for each parameter [62]. RMSprop stabilizes the optimization process by leveling learning rates and limiting rapid diminishment. RMSprop effectively manages non-stationary specifications and improves optimization stability and efficiency compared to AdaGrad by using a leaky average of prior squared gradients. RMSprop is well-suited for training deep neural networks and other models with intricate, high-dimensional parameter spaces.

  4. 4)

    Adagrad (Adaptive Gradient Algorithm): Adagrad is an optimization technique that optimizes the learning rates of individual parameters according on recent gradients. The system assigns higher learning rates to parameters that are updated seldom and lower learning rates to parameters that are updated regularly. Adagrad is especially useful for sparse data or features with varying scales since it adapts the learning rates according to the gradient magnitudes [63]. Adagrad’s long-term accumulation of squared gradients might cause learning rates to decrease, leading to slower convergence during later stages of training.

  5. 5)

    Adadelta: Adadelta is a modification to Adagrad designed to resolve the problem of decreasing learning rate. Adadelta computes a moving average of squared gradients by using a sliding window of prior gradients instead of treating all historic gradients equally [64]. Adadelta adjusts learning rates depending on recent gradients to address the issue of declining learning rates. Adadelta is ideal for optimization problems that need a decreased sensitivity to the learning rate, such as in RNNs.

  6. 6)

    Nadam (Nesterov-accelerated Adaptive Moment Estimation): Nadam is an extension of Adam that integrates Nesterov momentum into its update rule. Nesterov momentum calculates the gradient at a location slightly ahead in the direction of momentum, rather than at the present parameter values [65]. Nadam optimizes performance by using Nesterov momentum to predict parameter movements and modify update directions, resulting in quicker convergence and enhanced performance, particularly when dealing with noisy gradients.

  7. 7)

    AdaMax: AdaMax is a variation of the Adam optimization algorithm that incorporates the infinity norm in the update rule, resulting in improved convergence stability under certain circumstances [66]. AdaMax calculates the L-infinity norm of the exponential moving averages of the gradients, instead of the infinity norm of the gradient like Adam. AdaMax is modified to modify learning rates adaptively, independent of gradient magnitude, enhancing its resilience to noisy gradients and sparse data.

  8. 8)

    AMSGrad: AMSGrad is a modification of the Adam optimization method designed to resolve the problem of non-convergence that might occur in certain scenarios with the original Adam algorithm [67]. AMSGrad adjusts Adam’s update algorithm to maintain the denominator term (squared gradients) as monotonically rising to avoid the learning rate from growing excessively. AMSGrad maintains the highest historical squared gradients to improve stability and convergence compared to Adam, especially in situations with non-convex optimization goals.

Gradient Descent and its variations in Fig. 1 provide a range of optimization methods designed to tackle various obstacles in ML applications. By adjusting learning rates, adding momentum, and addressing challenges including decreasing learning rates and erratic gradients, these versions enhance the optimization of model parameters and expedite the attainment of optimum solutions.

Fig. 1
figure 1

Gradient descent and its variants in ML optimization techniques

3.2.2 Evolutionary Algorithms

Evolutionary algorithms are optimization methods that use populations and are influenced by the concepts of natural selection and evolution. The algorithms work on a group of potential solutions (individuals) and gradually improve them over several generations to discover the best or nearly best answers. Let’s explore some instances of evolutionary algorithms as shown in Fig. 2 and details are also explained.

Fig. 2
figure 2

Evolutionary algorithms of ML optimization techniques

  1. 1)

    Genetic Algorithms (GA): Genetic Algorithms (GA) imitate natural selection by progressively developing a group of potential solutions via selection, crossover, and mutation operations. The procedure starts by creating a population of people that represent possible solutions to the optimization issue [6869]. Each generation involves selecting people based on their fitness, which is their capacity to solve the challenge, and then subjecting them to genetic operations like crossover and mutation to create offspring. The children replace fewer fit people in the population, resulting in the progressive development of improved solutions over subsequent generations. Genetic algorithms use selection, crossover, and mutation to navigate the solution space, directing the search towards favorable areas and eventually reaching optimum or nearly optimal solutions.

  2. 2)

    Evolutionary Strategies (ES): Evolutionary Strategies (ES) is a classification of evolutionary algorithms that concentrates on enhancing continuous parameters by using a Gaussian distribution to create offspring solutions. Evolutionary strategy (ES) functions on a group of potential solutions, with each option being represented by an array of continuous variables. Each generation produces offspring solutions by adjusting the characteristics of parent solutions based on a predetermined Gaussian distribution [7071]. Offspring solutions are assessed according to their fitness, and those with superior fitness levels are chosen to generate the next generation. Evolutionary techniques iteratively adjust candidate solutions’ characteristics according on their performance to explore the solution space and converge towards optimum or near-optimal solutions.

  3. 3)

    Genetic Programming (GP): genetic Programming (GP) utilizes evolutionary algorithms to develop computer programs or models represented as trees. Genetic programming initializes populations of program structures, with each program serving as a possible solution to the optimization issue [7273]. Genetic operators like mutation, crossover, and reproduction are used to develop these program structures throughout many generations. Subtrees from parent programs are swapped during crossover to generate offspring programs that may have varied architectures. Mutation brings about random alterations in particular programs, whereas reproduction enables certain programs to be passed on without modifications to the following generation. Each program’s fitness is assessed according to its problem-solving capabilities, and selection processes decide which programs will be kept for generating children in the next generation. Genetic programming generates program structures by applying genetic operators and selection processes iteratively, leading to solutions that maximize the specified aim.

  4. 4)

    Differential Evolution (DE): Differential Evolution (DE) is an optimization approach that enhances candidate solutions by modifying their parameter vectors via differential mutation and recombination operations within a population-based framework. DE maintains a population of possible solutions, each characterized by a vector of parameter values. DE creates trial solutions in each iteration by merging vectors from the current population via differential mutation and recombination [7475]. The experimental solutions are evaluated against the current population, and individuals with better fitness are selected for the following generation. DE’s differential mutation technique promotes effective exploration of the solution space, while its recombination operations support the exploitation of attractive areas, resulting in quick convergence to optimum or near-optimal solutions.

  5. 5)

    Particle Swarm Optimization (PSO): Particle Swarm Optimization (PSO) is an optimization technique based on the collective behaviour of bird flocks or fish schools. The PSO algorithm maintains a population of particles, with each particle representing a potential solution to the optimization problem. Particles travel around the solution space in each iteration depending on their individual best-known location and the best-known position of the overall population [7677]. The movement is directed by velocity vectors that are adjusted according to each particle’s past behaviour and the combined impact of nearby particles. The exploration-exploitation mechanism of PSO allows for effective investigation of the solution space while taking use of interesting locations discovered by the swarm. PSO uses repeated adjustments of particle locations and velocities to achieve convergence towards optimum or near-optimal solutions for various optimization problems.

  6. 6)

    Cultural Algorithms (CA): Cultural Algorithms (CA) are evolutionary algorithms that include cultural evolution ideas into the optimization process. CA maintains a population of candidate solutions and a belief space that represents cultural knowledge or norms [7879]. Each iteration involves evaluating prospective solutions according to their fitness and updating cultural knowledge depending on the population’s performance. Cultural information impacts the development of potential solutions by favouring selection, crossover, and mutation operations in areas of the solution space that align with existing norms. By integrating genetic evolution with cultural development, CA efficiently reach optimum or almost optimal solutions while maintaining and spreading valuable information within the population.

  7. 7)

    Covariance Matrix Adaptation Evolution Strategy (CMA-ES): Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is an evolutionary strategy that aims to optimize continuous parameters by using a multivariate normal distribution. CMA-ES uses a group of potential solutions and modifies the covariance matrix of the multivariate normal distribution to direct the search. CMA-ES efficiently explores and exploits the solution space by adapting the covariance matrix according to the success of prior candidate solutions, resulting in quick convergence to optimum or near-optimal solutions [8081]. CMA-ES is suitable for optimizing positions involving high-dimensional parameter spaces and intricate, non-convex objective functions.

  8. 8)

    Memetic Algorithms (MA): Memetic Algorithms (MA) combines evolutionary algorithms with local search techniques to enhance solution quality. MA maintains a population of potential solutions that are subjected to genetic operations like mutation, crossover, and selection [8283]. Also, each potential solution undergoes local search techniques to investigate the surrounding area and enhance its quality. By combining worldwide exploration with nearby exploitation, MA can efficiently manage the balance between exploration and exploitation, resulting in enhanced convergence and solution quality.

  9. 9)

    Estimation of Distribution Algorithms (EDA): Estimation of Distribution Algorithms (EDA) is a type of evolutionary algorithms that use statistical techniques to represent the probability distribution of potential solutions. EDA develops and improves a probabilistic model of the solution space using the observed candidate solutions. The probabilistic model is used to generate fresh potential solutions, which are assessed and added to the existing population [8485]. EDA may effectively direct the search process towards favorable areas by explicitly modeling the probability distribution of the solution space, resulting in quicker convergence and enhanced solution quality.

  10. 10)

    Hybrid Evolutionary Algorithms: Hybrid Evolutionary Algorithms combines evolutionary optimization approaches with other optimization methods to use their complimentary capabilities. Hybrid algorithms combine evolutionary algorithms with gradient-based optimization methods, local search algorithms, or metaheuristic approaches to enhance convergence speed and solution quality [86]. Hybrid techniques combine evolutionary algorithms with other optimization methodologies to successfully solve complicated optimization problems with various features and constraints.

  11. 11)

    Self-Adaptive Evolutionary Algorithms: Self-Adaptive Evolutionary Algorithms modify their parameters and operations in response to the problem’s traits and the algorithm’s effectiveness. These algorithms adjust their mutation rates, crossover probabilities, population sizes, or other parameters automatically as they enhance their performance for particular issue scenarios [8788]. Self-adaptive evolutionary algorithms may provide resilient and efficient optimization in many problem domains and situations by adjusting to changing circumstances and needs.

  12. 12)

    Multi-Objective Evolutionary Algorithms (MOEA): Multi-Objective Evolutionary Algorithms (MOEA) are specifically created to maximize many goals that are in competition with each other at the same time. MOEAs preserve groups of potential solutions, with each one embodying a balance between many goals [8990]. The algorithms try to provide a collection of answers called the Pareto front, which represents the best possible compromises between competing goals. MOEAs use specific selection processes and tactics to maintain variety in order to effectively explore the Pareto front and find high-quality solutions that meet several optimization requirements.

Evolutionary Algorithms consist of several optimization approaches that are influenced by ideas of natural selection and evolution. These algorithms manage groups of potential solutions and gradually improve them using evolutionary processes, social exchanges, or adaptive methods to discover the best or nearly best answers for intricate optimization issues. Evolutionary Algorithms use the combined knowledge of populations and adjust to different conditions to provide strong and efficient methods for tackling a variety of optimization problems in different fields.

3.2.3 Bayesian Optimization

Bayesian optimization is an approach of optimizing black-box objective functions with prohibitive evaluations using a probabilistic framework [9192]. Bayesian optimization represents the objective function as a Gaussian process, providing a flexible and probabilistic representation of uncertainty without needing explicit gradients or derivatives of the objective function like classic optimization approaches.

  1. 1)

    Key Components of Bayesian Optimization:

  • Surrogate Model: Bayesian optimization provides a predictive probabilistic model of the desired function by using the evaluations that have been observed [93]. The surrogate model, typically a Gaussian process, represents the fundamental patterns and uncertainties in the objective function space.

  • Acquisition Function: The acquisition function promotes the pursuit of optimum solutions by managing the trade-off between exploration and exploitation. The usefulness of picking a candidate point is determined by considering both the predictions of the surrogate model and the uncertainty estimations. Common acquisition functions include Expected Improvement (EI), Probability of Improvement (PI), and Upper Confidence Bound (UCB) [94].

  • Sequential Sampling: Bayesian optimization involves selecting potential sites based on the acquisition function’s suggestions, evaluating the objective function at these points, and updating the surrogate model with the observed data in an iterative manner [95]. The sequential sampling approach effectively navigates the objective function space and concentrates the search on favourable areas.

  1. 2)

    Advantages of Bayesian Optimization:

  • Efficient Exploration: Bayesian optimization effectively navigates the objective function space by iteratively updating a probabilistic surrogate model and concentrating the search on areas with high potential for improvement [96]. This allows for efficient worldwide search even in complex or noisy optimization situations.

  • Sample Efficiency: Bayesian optimization uses probabilistic models and acquisition functions to efficiently optimize by needing fewer objective function evaluations than standard approaches. This is especially beneficial in situations when objective function evaluations are resource-intensive or time-consuming [97].

  • Adaptive Trade-off: Bayesian optimization uses the acquisition function to balance exploration and exploitation, enabling it to adjust between exploring new parts of the objective function space and exploiting already promising locations. This adaptive behavior allows for strong and effective optimization in several issue areas [98].

  1. 3)

    Applications of Bayesian Optimization:

  • Hyperparameter Tuning: Bayesian optimization has become popular for optimizing hyperparameters in ML models, with the objective function reflecting model performance indicators like accuracy or validation loss. Bayesian optimization effectively searches the hyperparameter space to find the best situations that optimize model performance and minimize computing resources [99].

  • Experimental Design: Bayesian optimization is utilized in experimental design to optimize parameters and situations, including medication dose in pharmaceutical research or experimental setup in scientific trials. Bayesian optimization speeds up the process of finding the best solutions by choosing trials that maximize predicted information gain in an iterative manner [100].

  • Automated Machine Learning (AutoML): Bayesian optimization is a fundamental element of AutoML systems, streamlining the tasks of model selection, feature engineering, and hyperparameter optimization. AutoML frameworks effectively explore the range of potential models and settings using Bayesian optimization to find high-performing ML processes [101].

Bayesian optimization provides a systematic and effective method for optimizing black-box objective functions that need costly assessments. Bayesian optimization is a powerful tool for optimizing ML models and other complex systems by efficiently exploring and exploiting the objective function space through probabilistic models, adaptive acquisition functions, and sequential sampling strategies.

3.3 Optimization Techniques for Various Machine Learning Technologies

Optimization strategies in ML are used in supervised learning, unsupervised learning, Semi-Supervised Learning, Transfer Learning, Metaheuristic Optimization, and reinforcement learning methodologies.

3.3.1 Supervised Learning

Supervised learning optimization strategies focus on minimizing a predetermined loss function that measures the difference between predicted and real labels. The goal is to develop a model that can precisely correlate input properties with their respective labels. Gradient-based optimization methods like SGD, Adam, and RMSprop are often used to train supervised learning models [102]. The methods update model parameters repeatedly by calculating gradients of the loss function with respect to the model parameters and changing them to minimize the loss.

3.3.2 Unsupervised Learning

Unsupervised learning tasks including clustering and dimensionality reduction involve maximizing objective functions using data structure and similarity measurements, without the use of labeled data. Clustering methods such as k-means and hierarchical clustering divide data into groups (clusters) where data points in the same cluster are more alike than those in other clusters [103]. Principal component analysis (PCA) is a dimensionality reduction approach that aims to decrease the number of variables in the data while retaining a significant portion of its variability. Optimization strategies often use iterative algorithms to update cluster assignments or alter data representations in order to minimize a given criteria, such as reducing intra-cluster variance or increasing between-cluster separation.

3.3.3 Reinforcement Learning

Reinforcement learning utilizes optimization methods to discover an optimum policy that maximizes cumulative rewards in a changing environment. Reinforcement learning algorithms use different optimization techniques to adjust the policy parameters using rewards and states observed during exploration and exploitation phases. Q-learning is a widely used reinforcement learning technique that updates the Q-values of state-action pairs repeatedly to estimate the ideal action-value function [104]. Policy gradient approaches, like REINFORCE, focus on optimizing policy parameters by calculating gradients of anticipated rewards in relation to the policy parameters and adjusting them via gradient ascent.

3.3.4 Semi-Supervised Learning

Semi-supervised learning is a ML approach that utilizes both labeled and unlabeled data to enhance model performance [105]. This method is beneficial in situations when acquiring labeled data is costly or time-intensive, but there is an abundance of unlabeled data. Here is a comprehensive review of the optimization components of semi-supervised learning.

  1. 1)

    Joint Minimization of Loss Function: Semi-supervised learning optimizes by simultaneously reducing the loss function for labeled and unlabeled input. The loss function usually has two parts: one for labeled data with ground truth labels and another for unlabeled data where model predictions are compared with pseudo-labels or inferred labels [106]. The goal is to determine model parameters that reduce the total loss for both identified and unlabeled occurrences.

  2. 2)

    Incorporation of Constraints or Regularization Terms: Semi-supervised learning approaches enhance the use of unlabeled data by integrating extra restrictions or regularization terms throughout the optimization phase. These constraints aim to enhance smoothness or consistency in predictions for comparable instances, use cluster structure or manifold shape in the data, and ensure agreement between labeled and unlabeled cases [107]. Regularization methods including entropy minimization, consistency regularization, and manifold regularization are often used to promote the model in generating more resilient and generalizable predictions.

  3. 3)

    Utilization of Semi-Supervised Learning Techniques: Semi-supervised learning approaches may be used to efficiently use the intrinsic structure or correlations present in the data. Self-training is the process of training a model on labeled data and then utilizing the model’s predictions to assign pseudo-labels to unlabeled occurrences in an iterative process [108]. Co-training involves training several models on distinct subsets of characteristics or representations and sharing reliable predictions between them. Graph-based regularization methods use graph structures derived from the data to transmit information and impose smoothness or consistency restrictions on the predictions.

  4. 4)

    Combination of Supervised and Unsupervised Optimization Techniques: Optimization methods in semi-supervised learning combine supervised and unsupervised strategies to use labeled and unlabeled data efficiently. Gradient-based optimization methods like SGD and its variations are often used to train semi-supervised learning models [109110]. The techniques optimize model parameters by calculating gradients of the joint loss function with respect to the parameters and repeatedly updating them to minimize the total loss across labeled and unlabeled data.

Optimization in semi-supervised learning involves reducing the loss function for both labeled and unlabeled data simultaneously, while integrating constraints or regularization terms to make use of the structure of unlabeled data efficiently. Methods like self-training, co-training, and graph-based regularization have a tendency to leverage the intrinsic structure or connections in the data to enhance model performance. Semi-supervised learning combines supervised and unsupervised optimization approaches to improve model generalization and performance by making more efficient use of existing data.

3.3.5 Transfer Learning

Transfer learning is a ML technique that uses information gained from a specific position or domain to improve performance on a related but distinct task or area [111]. Transfer learning optimization includes adjusting pre-trained models or representations for new tasks or domains while reducing the difference between the source and target data distributions. Here is an in-depth analysis of the optimization components of transfer learning:

  1. 1)

    Adaptation of Pre-trained Models: Transfer learning usually starts with a pre-trained model that has been trained on a source task or domain using a substantial quantity of labeled data. The optimization phase entails customizing the pre-trained model for the particular position or area, especially when there is a scarcity of labeled data. The objective is to fine-tune model parameters to get high performance on the specific task by effectively using the information stored in the pre-trained model [112].

  2. 2)

    Fine-tuning: Fine-tuning is a prevalent method in transfer learning that involves adjusting pre-trained models to suit new tasks or domains. The pre-trained model’s parameters are adjusted by gradient-based optimization methods like SGD or its variations throughout the fine-tuning process. The optimization procedure includes calculating gradients of the loss function of the goal task in relation to the model parameters and then updating them gradually to reduce the loss. Fine-tuning enables the model to modify its representations and acquire task-specific characteristics while preserving important information from the original position [113].

  3. 3)

    Feature Extraction: Another method of transfer learning is feature extraction, in which the pre-trained model’s parameters remain unchanged, and only the top layers (classifier) are substituted or retrained for the new position. Optimization in this scenario entails training the classifier using characteristics derived from the pre-trained model while leaving the bottom layers fixed. Feature extraction is beneficial when the source and target activities have comparable low-level properties but vary in higher-level representations or output spaces [114].

  4. 4)

    Domain Adaptation: Domain adaption strategies are used in transfer learning when the source and destination domains have dissimilar distributions. Domain adaptation optimization minimizes the distribution difference between the source and target domains to modify the model to the target domain. Methods like adversarial training, domain adversarial neural networks (DANN), and domain confusion loss can be used to harmonize feature distributions across domains during optimization [115].

  5. 5)

    Partial Retraining: During some transfer learning situations, it might be advantageous to selectively retrain certain components of the pre-trained model while leaving the rest unmodified. Partial retraining enhances adaptability to the target task or domain by concentrating optimization on the most pertinent components of the model. In CNNs, only the remaining layers may be trained again for a new classification task, while the convolutional layers remain unchanged to preserve low-level feature representations [116].

Optimization in transfer learning involves adapting pre-trained models or representations to new tasks or domains while reducing the difference between the source and target data distributions. Methods like fine-tuning, feature extraction, domain adaptation, and partial retraining have a tendency to effectively optimize model parameters for the specific task or domain. Transfer learning allows for more effective and efficient learning with limited labeled data by using knowledge transfer from pre-trained models and adjusting representations for new tasks.

3.3.6 Metaheuristic Optimization

Metaheuristic optimization methods are a variety of optimization algorithms that derive motivation from natural and social circumstances [117118]. These algorithms provide potent resources for addressing intricate optimization issues in several fields. Here is an in-depth assessment of metaheuristic optimization:

  1. 1)

    Characteristics of Metaheuristic Optimization:

  • Inspiration from Natural and Social Phenomena: Metaheuristic optimization is impacted by various natural and social phenomena such as evolutionary processes, swarm behaviour, and physical dynamics. Metaheuristic algorithms strive to effectively explore the solution space and identify near-optimal solutions by imitating the processes found in these systems [119].

  • Heuristic Search Strategies: Metaheuristic algorithms use heuristic search methods to efficiently explore intricate solution spaces. The techniques manage the search process by systematically investigating exciting areas, using identified solutions, and maintaining a balance between exploration and exploitation to prevent premature reliance on impoverished responses [120].

  • Iterative Improvement: Metaheuristic optimization techniques usually consist of iterative procedures that progressively enhance potential solutions over various generations or iterations. These algorithms improve potential solutions, adjust search methods, and move towards almost perfect solutions via repeated cycles [121].

  • Adaptation and Flexibility: Metaheuristic algorithms demonstrate adaptability and flexibility by dynamically adjusting parameters, strategies, and search operators depending on issue features, solution quality, or convergence progress. This flexibility improves their strength and efficiency in various optimization tasks [122].

  1. 2)

    Examples of Metaheuristic Optimization Algorithms:

  • Firefly Algorithm (FA): The Firefly Algorithm (FA) is inspired by the bioluminescent flashing characteristic of fireflies, which they employ to attract mates or prey. Candidate solutions in Firefly Algorithm appear as fireflies throughout the solution space [123]. Fireflies are attracted to brighter ones, symbolizing superior solutions, and navigate the solution space by considering the appeal of nearby fireflies. The Firefly Algorithm (FA) tries to optimize solutions by repeatedly adjusting the placements of fireflies according to their attractiveness and the distance between them. Fireflies gather toward brighter sources, mimicking the synchronized flashing activity seen in nature.

  • Harmony Search (HS): Harmony Search (HS) is based on performers improvising tunes that develop harmony, generating new melodies by blending components from current ones. In high school, potential solutions are shown as musical harmonies. The program creates fresh solutions by merging components from current harmonies and modifies them to enhance harmony [124]. HS tries to identify the optimal solution by continuously improving potential harmonies according to their harmony values. The program navigates the solution space by modifying musical components and strives towards harmonic solutions.

  • Bat Algorithm (BA): The Bat Algorithm (BA) is based on the echolocation behaviour of bats, in which bats use ultrasonic pulses for navigation and prey detection. Candidate solutions in the field of business analysis are presented as bats that generate sonar pulses inside the solution space [125]. Bats use echolocation to explore the solution space and modify their locations according to the strength of echoes, which indicate solution quality. BA seeks to determine the optimal solution by continuously adjusting the placements of bats according to their echolocation input. Bats are attracted to areas with increased echo intensity, mimicking their natural foraging behaviour.

  • Cuckoo Search (CS): Cuckoo Search (CS) is based on the brood parasitism behaviour of some cuckoo species, in which cuckoos deposit their eggs in the nests of other bird species. In the field of computing, proposed solutions are symbolized as cuckoo eggs placed in the nests of host birds. Cuckoos deposit eggs in host nests according on their quality and will substitute current eggs if they are superior [126]. Computer Science seeks to discover the optimal solution by continuously updating cuckoo eggs in host nests. Cuckoos search for optimal solutions by depositing eggs in nests with superior solutions, mimicking the brood parasitism behaviour observed in nature.

  • Artificial Bee Colony (ABC) Algorithm: The ABC Algorithm is based on honeybees’ foraging behaviour, in which bees share information about food sources via waggle dances. In the Artificial Bee Colony (ABC) algorithm, proposed solutions are symbolized as food sources, and artificial bees navigate the solution space by inspecting these sources. Bees share details about food quality and choose better food sources by receiving comments [127]. ABC seeks to determine the optimal solution by continuously adjusting the locations of food sources. Bees navigate the solution space by locating potential sources with increased nectar levels, mimicking the foraging behaviours of honeybees.

  1. 3)

    Applications of Metaheuristic Optimization:

  • Engineering Design: Metaheuristic optimization methods tend to be utilized in engineering design tasks, including structural optimization, aerodynamic design, and parameter tuning in engineering systems [128].

  • Scheduling and Logistics: Metaheuristic optimization techniques are used in solving scheduling and logistics issues such as task scheduling, automobile routing, and resource allocation [129].

  • Machine Learning: Metaheuristic optimization techniques are used in a range of ML applications, including hyperparameter optimization, feature selection, and neural network design optimization [130].

  • Combinatorial Optimization: Metaheuristic optimization is widely used in solving combinatorial optimization issues which include that found in graph theory, network optimization, and combinatorial game theory [131].

Metaheuristic optimization techniques provide flexible and effective methods for addressing intricate optimization issues in several fields. By using heuristic search tactics inspired by natural and social events, these approaches facilitate effective exploration of solution spaces and discovery of near-optimal solutions in cases where standard optimization methods may struggle [132]. Optimization strategies are essential in a wide range of ML approaches, such as supervised learning, unsupervised learning, reinforcement learning, semi-supervised learning, transfer learning, and meta-learning. These methods assist models in efficiently acquiring knowledge from data, adjusting to different tasks or domains, and extrapolating to unfamiliar contexts or situations. Practitioners may choose and modify optimization approaches that align with their ML tasks and goals by comprehending the distinct aims and difficulties of each pattern.

Table 3 Comparison of optimization algorithms

A comprehensive overview of the optimization techniques widely used in ML is presented in Table 3. The paper delineates essential attributes including but not limited to convergence speed, memory demands, scalability, and applicability for diverse problem domains. This review evaluates the merits and drawbacks of various optimization algorithms, ranging from conventional techniques such as SGD to metaheuristic algorithms like genetic algorithms and PSO [143]. The results support practitioners and researchers in determining which optimization approach is most suitable for their particular application domains. Also, qualities such as interpretability and robustness are taken into account, offering significant perspectives on the real implications of implementing each optimization method. In general, this table behaves as a highly beneficial reference instrument for comprehending and contrasting the heterogeneous realm of ML optimization algorithms.

4 Challenges in Machine Learning Optimization

ML optimization experiences several challenges that affect the efficiency, efficacy, and resilience of learning algorithms is shown in Fig. 3. The challenges consist comprising the following:

  1. 1)

    Optimization Landscape: The optimization terrain is intricate, including non-convexity, multimodality, and irregular geometries [144]. Non-convex objective functions provide challenges in locating global optima, while multimodal functions include numerous local optima. Complex geometries hinder optimization due to abrupt peaks, valleys, and discontinuities, which provide challenges for optimization algorithms to get the correct responses.

  2. 2)

    Scalability Issues: ML activities involve managing extensive datasets and complex parameter spaces, which might result in scalability issues. Handling large datasets may place a heavy load on processing resources and memory limitations [145]. Also, studying parameter spaces with huge dimensions increases the computational workload of optimization algorithms, because they need to search across a large search space to locate the most suitable responses.

  3. 3)

    Optimization Convergence: Optimization algorithms may face convergence problems such as premature convergence, local optima, and saddle points. Premature convergence is when the algorithm stops before finding the best answer because it did not fully explore the search space [146]. Local optima are areas where the objective function approaches a minimum, although it may not be the overall minimum. Saddle points provide difficulties by causing gradients to either disappear or diminish, impeding advancement towards the most effective solution.

  4. 4)

    Robustness and Generalization: Ensuring resilience and adaptability is essential in optimizing ML. Overfitting happens when the model incorporates noise or irrelevant patterns from the training data, resulting in worse performance on data that is novel [147]. Underfitting happens when the model is too basic to capture the fundamental patterns in the data. Model selection involves identifying the suitable model structure and hyperparameters to achieve optimum performance while avoiding overfitting or underfitting.

  5. 5)

    Computational Complexity: Optimization techniques encounter time and resource limits due to their computational complexity, especially in large-scale or real-time applications. Iterative optimization techniques may demand significant computer resources and time to reach optimum solutions, restricting their practical use in situations with severe time restrictions or restricted computational resources [148].

  6. 6)

    Data Efficiency: Optimization approaches require substantial data quantities to efficiently learn model parameters, posing difficulties in situations when data is limited or costly to get. Data efficiency deals to the capacity of optimization algorithms to learn well from a little amount of data samples while maintaining high performance levels. Transfer learning, meta-learning, and active learning are methods used to enhance data efficiency by using knowledge from similar activities, adjusting to new tasks with less data, and selecting obtaining valuable data points during training [149].

  7. 7)

    Heterogeneity and Noise: Real-world datasets can be varied and noisy, consisting of a variety of types, sources, and quality levels. Interacting with various data types and distributions poses issues when creating optimization algorithms due to heterogeneity. Data noise may conceal genuine patterns and correlations, resulting in subpar model performance. Utilizing robust optimization techniques, robust loss functions, and data pretreatment approaches is crucial for reducing the influence of heterogeneity and noise on optimization solutions [150].

  8. 8)

    Interpretability and Explainability: As ML models are becoming increasingly used in important fields like healthcare and finance, the simplicity and comprehensibility of optimization results become crucial. Modern optimization methods may provide precise models but lack clarity in their decision-making process [151]. Comprehensible optimization algorithms and model frameworks enable stakeholders to comprehend and have confidence in model predictions, which aids in model deployment and decision-making in real-world situations.

  9. 9)

    Ethical and Fair Optimization: Optimization methods need to include ethical considerations and fairness issues in ML applications. Biases in data or optimization algorithms may result in unjust treatment of individuals or communities, which can perpetuate social imbalances. Fair optimization strategies strive to reduce biases and achieve fair results across various demographic groups, emphasizing justice, transparency, and responsibility in ML systems [152].

  10. 10)

    Dynamic Environments: Optimization strategies need to adapt and develop in circumstances where data distributions, aims, or limitations vary over time [153]. Continuous learning, online optimization, and adaptive algorithms allow models to adapt and remain effective and relevant in dynamic environments by learning and adjusting to changing issues.

Fig. 3
figure 3

Various challenges in machine learning optimization

Table 4 Challenges in machine learning optimization and associated optimization methods

A methodical overview of the primary challenges in ML optimization is provided in Table 4, along with optimization techniques that are aimed to solve these difficulties. It provides a methodical framework for comprehending the main challenges encountered in ML model optimization and the associated strategies used to overcome these difficulties. This table’s objective is to make it easier for readers to comprehend the wide range of difficulties that come with improving ML algorithms as well as the numerous optimization techniques that may be used to address these difficulties. Researchers, practitioners, and enthusiasts in the area of ML and optimization may benefit greatly from the table’s classification of issues and associated techniques.

5 Future Research Directions

Future prospects for ML optimization research include a broad spectrum of areas with the objective of resolving present challenges and expanding the potential of optimization methods. Below are some essential recommendations:

Table 5 Future research directions in machine learning optimization

An overview of the emerging approaches for ML optimization research is presented in Table 5. In order overcome problems, maximize the use of opportunities, and improve the state-of-the-art in ML model optimization for many applications and domains, it identifies important areas where more study needs to be conducted.

6 Conclusion

In summary, this paper explored a variety of aspects associated with ML optimization, from previous developments to present difficulties and future potential areas for research. The importance of optimization in ML and its critical role in training models to achieve high accuracy, efficiency, and generalization capabilities were one of the initial issues we discussed. We surveyed optimization strategies and classified them according to goals including speed, accuracy, and generalization. The result provided us knowledge about widely used algorithms and how they are used in various ML scenarios. Based on a review of the literature, we were able to identify important publications and contributions by well-known scholars that provided knowledge about algorithmic developments, theoretical underpinnings, and real-world optimization applications. We identified typical optimization obstacles, such as the non-convex nature of objective functions, convergence problems, scalability issues, and resilience, generalization, and computational complexity concerns. We concluded by presenting several of proposals for potential fields of research in ML optimization. These involve studying novel optimization algorithms influenced by biological and social systems, integrating domain knowledge and priors into optimization processes, creating adaptive and self-tuning optimization techniques, addressing scalability challenges through parallel and distributed optimization, and improving the interpretability and explainability of optimization results. In the final analysis, the research and invention reported in this work emphasize the significance of continuous research and development in ML optimization. Researchers may progress in the field and develop more effective, reliable, and accessible ML models that can successfully handle challenging real-world situations by tackling present issues and investigating novel optimization paths.