1 Introduction

The traveling salesman problem is the most well-known combinatorial optimization problem. Traveling Salesman Problem (TSP) is used to find a routing of a salesman who starts from a home location, visits a prescribed set of cities and returns to the original location in such a way that the total distance travelled is minimized and each city is visited exactly once. TSP is solved very easily when there is less number of cities, but as the number of cities increases it is very hard to solve, as large amount of computation time is required. The numbers of fields where TSP can be used very effectively are traffic and military.

There are two classes of algorithms for solving the TSP: exact algorithms and heuristic algorithms. Exact algorithms are algorithms that always solve an optimization problem to optimality and guarantee to terminate with an optimal solution. Heuristics are designed to solve problems in a faster and more efficient fashion than traditional methods by sacrificing optimality, accuracy, precision or completeness for speed. These algorithms are most often employed when approximate solutions are sufficient and exact methods are rather computationally expensive. Approximate algorithms for the TSP may be subdivided into three classes: tour construction, tour improvement and hybrid algorithms [1].

Dynamic programming, branch and bound, linear programming, cutting plane methods are exact methods with different complexities to solve TSP problems but they all not applicable for large-scale problems [2].

Tour construction algorithms iteratively extend a connected partial tour or iteratively combine tour fragments of the best tour. Greedy-type or nearest neighbor construction heuristics starts with a vertex randomly chosen, and each vertex is connected to the vertex with minimal distance. Although nearest neighbor heuristic does not find high-quality solutions, it is integrated with perturbative search methods by which one or more solution components are modified in each step [35].

The tour improvement algorithms start with a given tour and replace two or more links of the tour with other links to obtain a shorter tour. 2-opt, 3-opt and λ-opt local search methods used as a mutation operator in tour improvement algorithms. They remove some edges from the tour and insert the new edge if the new tour is shorter [3].

Hybrid algorithms include both construction and improvement modes. Population-based heuristics such ant colony optimization (ACO) algorithms [611], neural networks [1216], evolutionary algorithms [1719], simulated annealing [2024], tabu search [21, 23, 25, 26], artificial bee colony algorithm [9, 27, 28] and particle swarm optimization [20, 29] are also used to solve combinatorial problems by employing tour construction and tour improvement operators.

The ACO algorithm for the traveling salesman problem has been introduced by Dorigo and Gambardella based on the fact that ants are able to find the shortest route between their nest and a source of food. A major deficiency of their algorithm is the premature stagnation of the search. Afterward, they used a different state transition rule and added a local pheromone updating rule to improve the performance of primary ACO [68].

Dong et al. [10] presented a hybrid approach, called cooperative genetic ant system (CGAS) which combines GA and ACO for solving TSP. The information exchange between ACO and GA is performed at the end of the each iteration. This exchange is used to ensure the selection of the best solutions for next iteration. Through this cooperation, the algorithm has a better chance to reach a global optimum. The experimental results show that CGAS outperforms the other GA/ACO algorithms on the TSP instances.

Gündüz et al. [9] presented a new hierarchic method based on the ACO algorithm and the artificial bee colony (ABC) algorithm. ACO is used for the path construction, and ABC is used for the path improvements. Thus, their hierarchic method was proposed to achieve a good solution in a reasonable time. The experimental results show that ACO-ABC algorithm has better performance than individual approaches of ACO and ABC.

Yong [11] presented a hybrid max–min ant system (HMMA) with a local search algorithm. First, the max–min ant system is used to find an approximate solution. Afterward, the local paths of adjacent four vertices in the approximate solution are converted into the local optimal paths with the four vertices and three lines inequality to get the better approximate solution. HMMA is experimented with small- and large-scale TSP instances.

GA is one of the computational models inspired by evolution in the nature, which has been used to a large number of real-world problems. Also, GA is used to get approximate solutions for TSP. High adaptability and the generalizing feature of GA help to execute the TSP by a simple structure [1719].

The combination of local search heuristics and genetic algorithm is a capable approach for finding near-optimum solutions to the TSP. Merz and Freisleben [30] offered an approach in which local search techniques are used to find local optima in a given TSP search space, and genetic algorithms are used to search the space of local optima in order to find the global optimum. They utilized new genetic operators for realizing the proposed approach, and the quality and efficiency of the solutions obtained for a set of symmetric and asymmetric TSP instances are discussed.

White and Yen [31] give details the development of a hybrid evolutionary algorithm for solving the TSP. The stratagem of the algorithm is to broaden the successful results of a genetic algorithm using a distance preserving crossover (DPX) by including memory in the form of ant pheromone during the city selection process. The synergistic combination of the DPX-GA with city selection based on probability found out by both distance and previous success adds additional information into the search mechanism. This combination into a hybrid GA facilitates finding quality solutions for TSP problems with lower computation complexity.

Machado and Lopes [32] proposed hybrid approach that joins PSO, genetic algorithms and fast local search for the TSP. The positions of the particles represent TSP tours as permutations of |N| cities. The value assigned to each particle (fitness) is the rate between a constant Dmin and the cost of the tour represented in the particle’s position.

Yang et al. [33] used generalized chromosome genetic algorithm (GCGA) to solve the classical TSP, which was previously proposed for solving generalized TSPs (GTSP) by these authors. Numerical experiments show the advantages of the GCGA for solving a large-scale TSP.

Chen and Chien [20] presented a new method, called the genetic simulated annealing ant colony system with particle swarm optimization techniques (GSAP), for solving the traveling salesman problem. They use the ant colony system to generate the initial solutions of the genetic algorithms. Then, use the genetic simulated annealing techniques to generate better solutions based on the initial solutions. After a predefined number of cycles, the system uses the particle swarm optimization techniques to exchange the pheromone information between groups.

Wang proposed a new hybrid genetic algorithm with two local optimization strategies (HGA) to solve the STSP [34]. Two local search methods are applied to improve the genetic algorithm. The first local search is a four vertices and three lines inequality applied to local Hamiltonian paths to generate shorter tours. The second local search is then executed to reverse the local Hamiltonian paths with more than two vertices, which also generates shorter tours.

Deng et al. [35] offered an approach in which new initial population strategy based on the k-means algorithm (KIP) is used to improve the genetic algorithm for solving TSP. The results show that KIP can decrease best error value of random initial population strategy and greedy initial population strategy with the ratio of approximately between 29.15 and 37.87%, average error value between 25.16 and 34.39% in the same running time.

In addition, the other intelligent algorithms are combined with GA to improve its performance. Lima et al. [36] used a technique of reinforcement learning, the Q-learning algorithm, for the constructive phase of the GRASP metaheuristic and to generate the initial population of a genetic algorithm. The proposed methods are applied to the symmetrical traveling salesman problem.

Liu and Zeng [37] proposed a new genetic algorithm with reinforcement learning (RMGA) to solve the TSP. Their method uses reinforcement mutation to improve the GA. The experimental results show that RMGA outperforms the known EAX-GA and LKH in the quality of solutions and the running time.

Santos et al. [38] proposed a parallel hybrid GRASP/GA using reinforcement learning to solve the symmetric TSP. They improved the performance of a metaheuristic using multiple search trajectories, which act competitively and/or cooperatively and also reinforcement learning is applied for guiding the metaheuristic to regions of promising solutions using the acquisition of information on the problem.

In this paper, we propose a hybrid algorithm including, GA with a novel crossover operator, Smart Multi-point crossover (SMX), as tour improvement heuristic and Multiagent Reinforcement Learning (MARL) [39], as tour construction heuristic to solve traveling salesman problem. In this way, initial population of GA is generated using MARL heuristic, and then, these tours are improved by GA as tour improvement heuristic in order to achieve better solutions based on the given initial solutions.

The paper is organized as follows. After this introduction, some necessary preliminaries about the TSP and GA are listed in Sect. 2. Section 3 gives some details about the MARL heuristic for TSP. Section 4 describes the proposed hybrid algorithm. Section 5 presents the obtained experimental results and its comparison to the other heuristic methods. Finally, Sect. 6 states the conclusion of our paper.

2 Preliminaries

2.1 Traveling salesman problem (TSP)

The TSP can be generally defined as the consideration of a set \(N\) of nodes representing the cities, and a set \(E\) of edges that fully connect the nodes \(N\), where \(W\left( {i, j} \right)\) is the distance between cities i and j (length of the edge \(\left( {i , j} \right) \in E\)), with \(i , j \in N\). The TSP is the problem of finding a minimal length Hamiltonian tour on the undirected graph \(G = \left( {N, E} \right)\), where an Hamiltonian tour of graph \(G\) is a closed tour that entails visiting once and only once all the \(n = \left| N \right|\) nodes of \(G\), and where its length is given by the sum of the lengths of all the edges of which it is composed [40]. The aim of solving TSP is to minimize the total closed tour length \(f\) and is defined as Eq. (1).

$$f = \mathop \sum \limits_{i = 1}^{n - 1} W\left( {i , i + 1} \right) + W\left( {n , 1} \right)$$
(1)

where \(n\) is the total number of cities.

2.2 Genetic algorithm

The genetic algorithm (GA) is a heuristic search algorithm which has been used to solve search and optimization problems. It is based on the principle of “the survival of the fittest,” given by Charles Darwin. It is said to simulate the natural evolution carried out in living beings [41].

GA begin with various problem solutions which are encoded into population, a fitness function is applied for evaluating the fitness of each individual, after that a new generation is created through the process of selection, crossover and mutation. After the termination of genetic algorithm, an optimal solution is obtained. If the termination condition is not satisfied, then algorithm continues with new population. The basic steps of genetic algorithm used are given below:

Initialization

An initial population is generated from many individual solutions. A problem depends upon size of the population that contains several hundreds or thousands of possible solutions. The search space contains all the possible solutions from which the population is generated randomly. However, the solutions are seeded in the areas from where the optimal solutions are likely to be found. In this, if there are N numbers of cities, then N! possible number of solution can be made.

Encoding scheme

In this, we represent each city with number, for example if there are 10 cities then each city is represented from number 1 to 10 such as 9 1 10 2 4 3 6 7 5 8, and no number will be repeated. For N number of cities, the cities are represented by permutation of integers from 1 to N.

Selection

This operator is used to select the fitter chromosomes from the population for next breeding. The selected chromosomes are called parent chromosomes. The chromosomes selected with largest fitness value.

Crossover operator

Crossover operators are the backbone of the genetic algorithm. Reproduction makes clones of good strings but does not create new ones. Crossover operators are applied to mating pool with hope that it creates a better offspring. There are varieties of crossover and mutation operators which differ from each other in aspects like having fixed or variable length, being ordered or not and gene repetition. Among them, partially mapped crossover (PMX), order crossover (OX) and cycle crossover (CX), edge recombination (ER) and multipoint crossover (MX).

  • PMX method PMX builds an offspring by choosing a subsequence of a combination from one parent and preserving the order and position of as many parameters as possible from the other parent [42]. A subsequence of a combination is selected by choosing two random cut points, which serve as boundaries for swapping operations. The PMX crossover exploits important similarities in the value and ordering simultaneously when used with an appropriate reproductive plan. However, there is a disadvantage with this crossover that happens when there is a tie in mapping.

  • OX method OX builds offspring by choosing a subsequence of combination from one parent and preserving the relative order of parameters from the other parent [43].

  • CX method CX builds offspring in such a way that each parameter (and its position) comes from one of the parents [44]. The CX preserves the absolute position of the elements in the parent sequence.

  • ER method ER transfers more than 95% of the edges from the parents to the single offspring [45]. ER had the fastest convergence and converged on better solutions than other operators when optimizing a number of combinatorial problems of size 30–75 parameters.

  • MX method In MX method, N cutoff points between 1 and length of chromosome are randomly selected to divide each parent chromosome into N + 2 parts, then odd parts of the first parent and even parts of the second parent generate the first offspring and odd parts of the second parent and even parts of the first parent generate the second offspring [46].

Mutation operator

The mutation operator enhances the ability of the GA to find a near optimal solution to a given problem by maintaining a sufficient level of genetic variety in the population, which is needed to make sure that the entire solution space is used in the search for the best solution. In a sense, this operation avoids premature convergence and escape algorithm from local optima. The mutation operator generates a new offspring by randomly swapping genes. The probability of the mutation is a parameter of the genetic algorithm.

Evaluation

The system evaluates the fitness value of each chromosome based on a fitness function, after the crossover operation and the mutation operation.

Termination

The termination criteria can be characterized by the maximum number of iterations, the computation time or the number of iterations with no improvement. If the termination criteria are reached, the chromosome with the highest fitness value is the best solution. Otherwise, system continues.

3 MARL heuristic for TSP

In this work, we used the MARL heuristic [39] for constructing the primary tour of TSP, and then, 2-opt and nearest insertion into the convex hull local search (NICH-LS) [47] improve the given primary tour.

The MARL approach for solving TSP starts by creating a graph of TSP instance and placing \(m\) cooperative agents at that graph. Each agent is a traveling salesman and its tour is constructed by incrementally selecting cities (nodes or states) until all cities have been visited and coming back at first city of the tour. The MARL model consists of the learning environment, the learning algorithm and a reward function to evaluate the effectiveness of the agents learning [48, 49].

For each agent a scenario like this occurs: Initially, agent randomly placed at one of the graph’s nodes as a first city of the tour. As a result, candidate list of first city plays roll of the agent’s actions which now placed at that city. Candidate list cities have not been visited yet. Agent chooses one of its actions based on one of the selection methods. Then, agent moves at the selected city. The process of selecting an action and moving the agent at that city is repeated until a tour is created, this means, every node of the graph is visited and coming back at starting city (or making a feasible tour is impossible).

An iteration is finished, once the \(m\) agents have created their own tours. Then, best of the tours created by \(m\) agents in current iteration is selected, and the environment uses the length of this tour to produce its response. This response, depending on whether it is favorable or unfavorable, causes the selected actions of all states along the traversed path be rewarded if it is favorable according to Q-learning algorithm. This updating encourages the use of shorter routes and increases the probability that future routes will use the links along the best solutions. This process is repeated for a predetermined number of iterations, and after each iteration, all the tours created by \(m\) agents, which have shorter length than the best tour from last iterations, are inserted into candidate solutions set as initial population of the next stage. Of course, before adding the tours to candidate solutions set, they are optimized using 2-opt and NICH-LS.

NICH-LS improves the given tour by locally manipulating the order of nodes in the partial tours of the given tour via creating convex hull of the nodes in the partial tour and adding remaining nodes using nearest insertion method, and if the length of manipulated partial tour is smaller than the primary partial tour, then partial tour is replaced with manipulated partial tour, which reduces the length of the primary tour. The outline of the MARL algorithm is shown in Fig. 1.

Fig. 1
figure 1

Modified MARL algorithm

4 Proposed hybrid algorithm (GA + MARL)

The suggested algorithm combines the advantages of MARL and GA; hence, MARL heuristic drives the exploration of the search space, thus, focusing on the global optimization task and produces an acceptable solution, and then, this solution is optimized using the GA improvement heuristic by visiting the promising subregions of the solution space. Thus, MARL is used as construction heuristic, and some of the best solutions obtained by MARL are given to the GA as initial population.

In this section, we propose an improvement heuristic based on GA and try to improve the tours in the candidate solutions set, taken from MARL + NICH-LS by using GA in order to increase quality of the solution.

4.1 Chromosome representation

The solutions of TSP problem (tours) can be represented by chromosomes that consist of genes. Each chromosome gene indicates a city. The values of these genes and their position in the “gene string” tell the genetic algorithm what solution the individual represents. TSP problem has been solved in different chromosome representations such as binary strings and matrices by the GA algorithm. The binary- and matrix-based representations usually use the binary alphabets for the tour representation. The path representation is the most natural representation of a tour. In this representation, a path is a list of \(n\) cities, and if the city \(x\) is the y-th element of the list, then city \(x\) will be the y-th city to be visited. For example, path \(7 - 3 - 5 - 6 - 4 - 2 - 1\) is represented as it is, that is via \(7 - 3 - 5 - 6 - 4 - 2 - 1\). In this paper, we considered path representation for encoding the chromosomes.

4.2 Population initialization

As mentioned earlier, in the proposed hybrid algorithm, the initial population is taken from MARL heuristic [39]. On the other words, some of the best solutions obtained by MARL (tours in the candidate solutions set) are given to the GA as initial population.

4.3 Evaluation of fitness function

Since GA is generally applied on maximization problems and the TSP is a minimization problem, the inverse of tour length is considered as the fitness function.

4.4 Selection

Through selection operation step, parents are selected for crossover. Here, the roulette wheel selection method is selected.

4.5 Crossover operator

The main purpose of this component is to create offspring using a given pair of solutions chosen through the selection operation procedure. This paper proposes a new crossover operator. Details of the proposed crossover operator and hybrid GA are explained in Sect. 4.8.

4.6 Mutation operator

Mutation operator is performed to the new chromosomes after crossover. It is used to preserve the genetic diversity of chromosomes at each generation of population. In this paper, exchange mutation operator (EM) is used. The EM randomly selects two cities in a tour and exchanges them. The mutation probability \({\text{PM}}\) is taken as 0.1.

4.7 Termination criteria

In the proposed approach, the maximum number of iterations (\({\text{MNSMX}}\)) is used as the termination criterion. After reaching the termination criteria, the chromosome with the highest fitness value is the best solution.

4.8 Proposed smart multipoint crossover operator

The proposed SMX receives two individuals (parent chromosomes) to be recombined and returns a new individual (offspring chromosome) created based on recombination of the parent chromosomes. The outline of SMX is as shown in Fig. 2.

Fig. 2
figure 2

Outline of SMX

The recombination of genes occurs as follows: Secific sequences of genes are inserted into the offspring between both chromosomes. To better understand, Fig. 3 presents two individuals to be recombined: Parent1 and Parent2.

Fig. 3
figure 3

Recombination of two individuals using SMX operator

  • At stage II, random number m = 2 is generated; therefore, Parent1’s genes belonging to sequence 1–2 (two genes of Parent1 from pnp to pnp + 1) are inserted into offspring (genes 2,5).

  • At stage III, parent is a Parent2, pno = 3, random number m = 4 is generated, position of the last inserted gene of offspring (gene 5) is 4 at Parent2; therefore, pnp = 5 and Parent2’s four genes after position 4 which not exist in the offspring are inserted circularly into offspring (genes 3,9,4,7).

  • At stage IV, parent is a Parent1, pno = 7, random number m = 1 is generated, position of the last inserted gene of offspring (gene 7) is 7 at Parent1; therefore, pnp = 8 and Parent1’s one gene after position 7 which not exist in the offspring is inserted circularly into offspring (gene 1).

  • At stage V, parent is a Parent2, pno = 8, random number m = 3 is generated, position of the last inserted gene of offspring (gene 1) is 2 at Parent2; therefore, pnp = 3 and Parent2’s maximum three genes after position 2 which not exist in the offspring is inserted circularly into offspring (genes 6,8).

If the length of generated offspring is smaller than the best individual length (individual with smallest length in the population), then first this favor offspring is optimized using 2-opt and NICH-LS, next this offspring is added to the population pool, otherwise discard it. This process continues for predefined numbers of iterations (\({\text{MNSMX}}\)). When stopping criteria are satisfied, GA improvement heuristic stops and shows the best individual which is the best tour that has been created by the heuristic. The outline of our GA improvement heuristic for TSP is shown in Fig. 4. In Fig. 5, the working diagram of proposed hybrid algorithm is depicted.

Fig. 4
figure 4

Outline of GA improvement heuristic for TSP

Fig. 5
figure 5

Working diagram of proposed hybrid algorithm (GA + MARL)

5 Experimental results

In this section, we report on experimental results obtained with proposed hybrid algorithm, on 34 standard benchmark instances (datasets) from the TSPLIB [50]. Since proposed algorithm have a stochastic component, thus may produce different solutions over multiple runs on the same dataset. Therefore, each experiment was performed 20 independent runs, and the best, worst and average were recorded for each run and the presented results are the average of 20 runs, for each dataset.

The proposed algorithm has been implemented using Microsoft Visual studio.net 2013 under 64-bit Windows 8.1 operating system. Experiments are conducted on a laptop with Intel Core-i5-4200U, 1.6 GHz CPU and 4 GB of RAM. In the proposed algorithm, the values of parameters are selected based on some preliminary trials. The selected parameters are those values that gave the best results concerning both the solution quality and the computational time needed to reach this solution. Table 1 shows the parameter settings of the proposed method.

Table 1 Parameter settings for hybrid algorithm

When developing autonomous learning agents, the performance depends crucially on the selection of reasonable learning parameters, for example, learning rates or exploration parameters [51]. In other words, successful reinforcement learning highly depends on the careful setting of learning parameters in reinforcement learning. Generally speaking, it is crucial that all the learning parameters are carefully tuned to elicit good performance in advance [52].

Figure 6 shows the results of running MARL algorithm on dataset eil101 with different values of learning parameters. For each experiment, a run of 10000 iterations (\({\text{MNLI}}\)) is performed and algorithm was executed 20 times independently.

Fig. 6
figure 6

Results of running MARL algorithm on dataset eil101 with different values of a the learning rates (α), b the rewards (\({\text{rd}}\)), c the Beta (\(\beta\)) and d the -greedy action selection probability (\({\text{EG}}\))

Figure 6a shows the results obtained by MARL algorithm on different learning rates (α ∈ [0,1]). A small value of α means the learning process will proceed at a low speed, while a too high value of α might affect the convergence property of the algorithm. Obviously, an appropriate α is very important to the performance of the whole algorithm. The results show that for a particular range of learning rates the algorithm performs better than other ranges and performance (average tour length) increases as α is increased to 0.75. But, when α is increased to more than 0.75, the performance reduced gradually. Therefore, we experimented in learning rate 0.75 in proposed algorithm.

In Fig. 6b, when reward (\({\text{rd}}\)) is around 1, algorithm is showing satisfactory results. But, when \({\text{rd}}\) is less than 1 and as much as approaches to zero, the performance of algorithm is reduced significantly. Also, increasing the value of \({\text{rd}}\) to more than 1 causes somewhat negative performance impacts. Therefore, we experimented in reward 1 in proposed algorithm.

Beta (β) is a parameter which determines the relative importance of heuristic value \(W^{ - 1} \left( {i, j} \right)\) (\(W\left( {i, j} \right)\) is distance between cities i and j) versus the Q-values. If β = 0, only Q-value affects on \(P(a|sc)\) (probability with which agent at city sc selects action a), and with increasing the value of β, we favor the choice of edges which are shorter. However, increasing the β to more than a certain value, we neglect the importance of Q-values gradually and this may cause negative effects on the learning process. According to Fig. 6c, when β is around 2, MARL algorithm shows satisfactory results. Therefore, we set the parameter β to 2.

In \(\in\)-greedy action selection method, a greedy action is selected most of the time (one of the learned optimal actions is selected greedily with respect to the Q-values) and—using a small probability—a random action is chosen once in a while. This ensures that after many learning episodes, all the possible actions will be tried a high number of times, leading to an optimal policy. In contrast, Softmax action selection method differs from \(\in\)-greedy in the way the random action is selected. A weight is assigned to each of the actions depending on their estimated values. A random action is selected based on the weight associated with it, ensuring that worst actions are unlikely to be chosen [53]. In proposed algorithm, each agent chooses one of its actions based on \(\in\)-greedy with probability of \({\text{EG}}\%\) and Softmax with probability of \({\text{SM}}\%\) (equal to 1- \({\text{EG}}\)), selection methods. As shown in Fig. 6d, when parameter \({\text{EG}}\) is around 0.8 to 0.9, algorithm is showing satisfactory results. Therefore, we set the parameter \({\text{EG}}\) to 0.85.

In order to see the effect of maximum number of learning iterations (\({\text{MNLI}}\)) and maximum number of SMX iterations (\({\text{MNSMX}}\)) to the convergence of the dataset eil101, let us see Fig. 7, which contains convergence graphs of the proposed algorithm on this dataset. In each graph, average running times in seconds are presented above the data label.

Fig. 7
figure 7

Convergence of the proposed algorithm on dataset eil101 with different values of a \({\text{MNLIs}}\) and b \({\text{MNSMXs}}\)

In Fig. 7a, we visualized the effect of different values for parameter \({\text{MNLI}}\). As we can see, after 1000-th iteration of proposed algorithm, the results have not generally shown notable improvements in average tour length proportional to the relatively high increase in running time of the algorithm. In other words, none of the runs have provided considerable convergence after the 1000-th iteration of learning according to intense growth of learning cycles. For this reason, at the rest of paper we set the parameter \({\text{MNLI}}\) to 1000.

From Fig. 7b, we observe that the convergence rate of the proposed algorithm clearly reduces after iteration about 40000. In other words, none of the runs have not provided significant convergence after the 40000-th iteration of SMX (\(\left( {100 \times \log_{10} n} \right)^{2}\)). For this reason, at the rest of paper we set the parameter \({\text{MNSMX}}\) to \(\left( {100 \times \log_{10} n} \right)^{2}\) and after \(\left( {100 \times \log_{10} n} \right)^{2}\)-th cycle, working of the GA improvement heuristic will be stopped.

Table 2 summarizes the experiment results of different stages of our algorithm on 34 TSP datasets for TSPLIB, where the first column shows the name of the dataset and the optimal solution length taken from the TSPLIB into the parenthesis, and the column “Method (stage)” shows the results of different stages of proposed hybrid algorithm. For example, average solution found by MARL + NICH-LS algorithm, “MARL + NICH-LS” for dataset eil101, is 645.47, which takes 0.76 s and its \({\text{PD}}_{\text{avg}}\) is 2.62. And average solution found by proposed hybrid algorithm, “GA-MARL + NICH-LS” (after applying GA improvement on candidate solutions taken from MARL + NICH-LS) on this dataset, is 642.6, which takes 1.08 s (total time needed to MARL + NICH-LS + GA) and its \({\text{PD}}_{\text{avg}}\) is 2.16. The column “C.S.S./P. size” shows the average size of candidate solutions set and population, which, respectively, are produced by MARL + NICH-LS and GA. The column “best” shows the length of the best solution found by algorithm, the column “average” gives the average solution length of the each algorithm, the column “worst” shows the length of the worst, the column “CPU time(s)” shows the average running times in seconds for each algorithm, the column “\({\text{PD}}_{\text{best}}\)” gives the percentage deviation of the best solution length over the optimal solution length and the column “\({\text{PD}}_{\text{avg}}\)” denotes the percentage deviation of the average solution length over the optimal solution length. Percentage deviation of the best found solution to the best known solution \({\text{PD}}_{\text{best}}\) and the percentage deviation of the average solution to the best known solution \({\text{PD}}_{\text{avg}}\) and are defined as Eq. (2).

Table 2 Computational results of proposed hybrid algorithm’s stages for 34 TSP datasets
$$\begin{array}{*{20}c} {{\text{PD}}_{\text{best}} = \frac{{{\text{Best solution}} - {\text{best known}}}}{\text{Best known}} \times 100} \\ {{\text{PD}}_{\text{avg}} = \frac{{{\text{Average solution}} - {\text{best known}}}}{\text{Best known}} \times 100} \\ \end{array}$$
(2)

The average \({\text{PD}}_{\text{best}}\) for MARL + NICH-LS and GA-MARL + NICH-LS are 0.99 and 0.65%, respectively. The average \({\text{PD}}_{\text{avg}}\) for MARL + NICH-LS and GA-MARL + NICH-LS are 2.00 and 1.54%, respectively. The average running times for MARL + NICH-LS and GA-MARL + NICH-LS are 23.00 and 29.71, respectively. According to \({\text{PD}}_{\text{best}}\) of the GA-MARL + NICH-LS, we can say that 85.29% of the values of \({\text{PD}}_{\text{best}}\) are less than 1%, which means that the best solution found, of the 20 trials, approximates less than 1% of the best known solution.

Figure 8 visualizes the \({\text{PD}}_{\text{best}}\), \({\text{PD}}_{\text{avg}}\) and running times of these two algorithms according to Table 2. The results presented in Table 2 and Fig. 8 show that the accuracy of GA-MARL + NICH-LS is quite promising and it can provide good results in reasonable time for both small size and large size datasets, although the running times are a little greater than MARL + NICH-LS.

Fig. 8
figure 8

Comparison graphically, between the MARL + NICH-LS and GA-MARL + NICH-LS based on a \({\text{PD}}_{\text{best}}\), b \({\text{PD}}_{\text{avg}}\) and c CPU time(s)

To compare the proposed algorithm (GA-MARL + NICH-LS) with counterpart algorithms [15, 16, 20, 22, 3336, 38] in terms of CPU time, we scale the CPU time of each algorithm by an appropriate scaling coefficient related to their processing systems [22]. The processing system, programming language and their scaling coefficients are shown in Table 3.

Table 3 Scaling coefficients for adjusting the CPU time of the counterpart algorithms with respect to the CPU time of the GA-MARL + NICH-LS

We compare the experimental results of the our algorithm with nine state-of-the-art algorithms such as the genetic algorithm (GCGA) [33], the adaptive simulated annealing algorithm with greedy search (ASA-GS) [22], the self-organizing neural network (RABNET-TSP) [15], the memetic neural network (Memetic-SOM) [16], the genetic simulated annealing ant colony system with particle swarm optimization techniques (GSAP) [20], the Q-learning algorithm for initialization of the GRASP metaheuristic and genetic algorithm (GA-GRASP-Q-lrn) [36], the parallel hybrid implementation using genetic algorithm, GRASP and reinforcement learning (HPM) [38], the hybrid genetic algorithm with two local optimization strategies (HGA) [34] and the improved genetic algorithm with initial population strategy (KIP) [35], and results are shown in Table 4. The meanings of the columns in Table 4 are same as those in Table 2 (time is in second and scaled according to Table 3), and the best results are given in bold. Also, average PDbest, PDavg and Time of each algorithm are given in italic (last column, Avg.). More intuitive comparisons are shown in Figs. 9, 10 and 11.

Table 4 Results of proposed algorithm are compared with obtained results of nine state-of-the-art algorithms
Fig. 9
figure 9

Comparison graphically, between \({\text{PD}}_{\text{best}}\) of GA-MARL + NICH-LS with eight algorithms

Fig. 10
figure 10

Comparison graphically, between GA-MARL + NICH-LS with eight algorithms in terms of CPU time

Fig. 11
figure 11

Comparison of the proposed algorithm with each of nine algorithms based on a average \({\text{PD}}_{\text{best}}\), b average \({\text{PD}}_{\text{avg}}\) and c average CPU time(s) on same datasets

With respect to Table 4, Figs. 9 and 10, it can be seen that, when compared with the GCGA, the RABNET-TSP, the Memetic-SOM, the GA-GRASP-Q-lrn, the HPM and the KIP, our hybrid algorithm not only found the known optimal solution that others succeeded, but also found that the others failed. Also, the average of the results obtained by our algorithm in terms of \({\text{PD}}_{\text{best}}\) and \({\text{PD}}_{\text{avg}}\) is usually better than these algorithms. When compared with the ASA-GS, the GSAP and the HGA, for some instances the best solution and average of the results obtained by these algorithms are slightly better than our algorithm and for some instances the results obtained by our algorithm are slightly better. Comprehensively speaking, the performance of our algorithm is much better than the GCGA, the RABNET-TSP, the Memetic-SOM, the GA-GRASP-Q-lrn, the HPM and the KIP and somewhat equal to the ASA-GS, the GSAP and HGA in terms of \({\text{PD}}_{\text{best}}\) and \({\text{PD}}_{\text{avg}}\).

Finally, we considered the average \({\text{PD}}_{\text{best}}\), \({\text{PD}}_{\text{avg}}\) and running times of the proposed algorithm and counterpart algorithms on same datasets, as shown in Fig. 11. With respect to results, the average \({\text{PD}}_{\text{best}}\) and running time of our algorithm on same datasets are less than of the all nine algorithms. Thus, our algorithm has higher performance than these approaches in terms of \({\text{PD}}_{\text{best}}\), and convergence rate of our algorithm is faster than of the all nine algorithms. Also, the average \({\text{PD}}_{\text{avg}}\) of our algorithm on same datasets is less than of the GCGA, the RABNET-TSP, the Memetic-SOM, the GSAP, the GA-GRASP-Q-learning, the HPM and the KIP algorithms and somewhat more than of the ASA-GS and the HGA algorithms. Totally, we can say our approach is superior to most of nine state-of-the-art methods and the speed of computation of our algorithm is considerably fast.

6 Conclusion

In this study, a hybrid algorithm to determine the optimal solution for TSP is presented. The proposed method utilizes MARL approach as tour construction heuristic, which generates initial population of GA and GA with new crossover operator, SMX is applied as tour improvement heuristic.

Our algorithm is tested using 34 symmetric TSP datasets ranging from 29 to 7397 cities. The achieved results indicate that the proposed algorithm has good performance with respect to the quality of solution and the speed of computation and verify its validity. The GA-MARL + NICH-LS performance is also compared with nine state-of-the-art algorithms, including the GCGA, the ASA-GS, the RABNET-TSP, the Memetic-SOM, the GSAP, the GA-GRASP-Q-lrn, the HPM, the HGA and the KIP. These state-of-the-art algorithms are outperformed by GA-MARL + NICH-LS in terms of accuracy and/or CPU time. It provided a better compromise between the CPU time and solution quality.