1 Introduction

Travelling Salesman Problem (TSP) standout one of the most combinatorial optimization problems that have been studied widely. It is considered as a challenging problem that belongs to NP-complete problems [1,2,3]. The problem itself is trivial and easy to understand, but the solution is difficult to obtain, since in the context of optimization problems, an optimal solution is required. Solving small data instances will not need substantial time, but the problem arises when solving large data instances of the TSP. Accordingly, several heuristic and metaheuristic algorithms were investigated to solve this problem to obtain approximate solutions instead of an optimal one. Recently, Pandiri and Singh [4] have presented two swarm intelligence approaches for solving multidepot salesmen problems with load balancing. The first approach is based on artificial bee colony algorithm, and the second one is based on an invasive weed optimization algorithm. Also, recently a metaheuristic method known as feasibility assured TSP-likened scheduling is presented in which the scheduling problem is translated into a construction graph, similar to TSP problem, such that it would be able to benefit from the advantages of ant colony optimization algorithm [5].

Heuristic algorithms can be categorized as constructive heuristics and improvement heuristics [6]. In the first category, a sub-optimal solution is provided by starting with an empty solution and continuously expand the existing solution until the final solution is acquired. While in the second category, the local search algorithms start with a sub-optimal solution and then attempt to improve it more, using decisions based on which solution in the neighborhood of the current solution must be moved to. This process continues until no improvement can be performed. Local search algorithm has been used to solve many optimization problems; such as, TSP and multiple sequence alignment tasks. For example, da Silva et al. [7] presented an algorithm called AlineaGA, which is a genetic algorithm, a metaheuristic approach with local search optimization for solving multiple sequence alignment tasks more accurately.

One of the local search algorithms that has significant influence, is the 2-opt algorithm [8]. This algorithm tries to improve a sub-optimal solution resulted from any constructive heuristic algorithm. The quality of the improved solutions generated by the 2-opt algorithm can be 5% above the Held-Karp lower bound [9]. However, when the TSP data instance size grows the time needed for comparing graph edges in the 2-opt algorithm grows remarkably. Therefore, researchers in this field tried to accelerate the speed of the local search algorithm and the 2-opt algorithm by means of parallelism.

Several approaches have been addressed for parallelizing this algorithm, one of them is the geometric partitioning approach proposed by Karp [10], where the author proposed a recursive schema for partitioning the cities into rectangles based on x and y coordinates of the cities, then assigns each rectangle to a processor to run the 2-opt algorithm on the set of cities within this rectangle.

Another approach is based on tour partitioning, which is proposed by Allwright and Carpenter [11], where the authors first applied a heuristic algorithm to obtain an initial tour that is divided into K segments of length \(M/P\), where \(M \) is the length of the initial tour and P is the number of processors. Subsequently, each segment is assigned to a processor which creates a sub-tour from it by adding an edge between the endpoints of the segment. Afterward, the 2-opt algorithm is applied to improve the sub-tour which is combined with other sub-tours from other processors to form the final improved tour. The drawback of this approach is causing the algorithm to halt before getting the sub-optimal solution. Therefore, Verhoeven et al. [12] modified the earlier approach to guarantee the tour partitioning will converge with a feasible solution. They tested their approach on several data instances from TSP Library (TSPLIB) [13] using a linear topology network of 512 transputers, their average quality of the obtained solutions was 8.3% within the optimal solution.

Other approaches were parallelized the local search algorithms using the Graphic Process Units (GPUs), where Rocki and Suda [14] copied each city coordinates in the shared memory and then one 2-opt algorithm was assigned to each GPU thread. Their average quality of the obtained solutions was8.9 % within the optimal solution. O’Neil and Burtscher [15] fully exploited the parallelism of the GPU hardware to accelerate the 2-opt local search algorithm for solving the TSP. Thus, the average quality of the obtained solutions of their parallelism was 8.24% within the optimal solution as shown by Zhou et al. [16] in their comparative evaluation. Moreover, Qiao and Créput [17] presented a parallel scheme similar to the adopted approach in [12], but on a GPU. Their approach divided the initial tour into disconnected tours and then applied the 2-opt algorithm on each sub-tour, the average quality of the obtained solutions of their approach was 11.4% within the optimal solution.

Despite the achieved acceleration of the 2-opt algorithm via the previous approaches, a noticeable degradation in the tour quality was recorded. Therefore, in this paper a repetitive approach is adopted instead of a partitioning approach, in order to gain high speedup in a much simpler approach without paying a penalty in the solution quality. Furthermore, to exploit the iterative structure offered by the Optical Transpose Interconnection System (OTIS), which is constructed from groups of processors connected using optical links, and the processors inside each group connected based on a factor network using electronic links [18]. The factor network that decides the way these processors are connected inside each group will be either a basic network such as Hypercube and Mesh [19], or a hybrid network such as Hyper Hexa-Cell (HHC) [20] and Mesh of Trees (MOT) [21]. The emergence of these optoelectronic architectures, motivates the researchers in parallel processing field, to parallelize algorithms for various problems on such architectures.

Various problems were applied on OTIS optoelectronic architectures. For example, in [22,23,24], sorting, prefix sum, routing, consecutive sum, data accumulation, and matrix multiplication algorithms were applied on OTIS-Mesh. Load balancing and wormhole routing were implemented on OTIS networks [25,26,27]. Routing, sorting and communication algorithms were implemented on OTIS-HHC [28, 29]. Parallel prefix and parallel enumeration sorting were implemented on OTIS-MOT [30, 31]. Parallel nearest neighbor algorithm for solving TSP is applied on OTIS-Hypercube and OTIS-Mesh [32]. However, until present, there is no work that is applied a local search algorithm for solving the TSP on OTIS-HHC and OTIS-MOT. Thus, in this paper, OTIS-HHC and OTIS-MOT optoelectronic architectures [20, 21] were selected to solve the TSP using the Parallel Repetitive 2-opt algorithm (PRTO). This algorithm is evaluated analytically under the following performance metrics: parallel time complexity, number of communication steps, speedup, efficiency, cost, and communication cost. Also, its performance is evaluated by several simulation experiments conducted on both optoelectronic architectures. Moreover, PRTO algorithm is applied on the factor network of each optoelectronic architecture to show the advantage of such optoelectronic architecture on their factor network. Consequently, OTIS-HHC performance is compared with HHC and OTIS-MOT performance is compared with MOT. Also, a comparison between PRTO algorithm and Parallel Repetitive Nearest Neighbor (PRNN) algorithm for solving TSP is achieved in terms of speedup, efficiency, and solution quality, which showed the superiority of PRTO algorithm over PRNN algorithm on OTIS-HHC and OTIS-MOT optoelectronic architectures. Moreover, a realistic comparison is achieved in terms of speedup and efficiency between PRTO algorithm on OTIS-HHC and MAX-MIN Ant colony System (MMAS) on Sunway Blue Light supercomputer for solving TSP, which showed the superiority of PRTO algorithm over MMAS algorithm. To the best of our knowledge, this is the first work that parallelize the 2-opt algorithm on optoelectronic architectures and that PRNN is the only heuristic algorithm that has been used for solving TSP on optoelectronic architectures.

The organization of this paper is as follows: Section 2 introduces a background on OTIS optoelectronic architectures, namely; OTIS-HHC and OTIS-MOT. Section 3 illustrates the PRTO algorithm in solving TSP over OTIS-HHC and OTIS-MOT. Section 4 presents an analytical evaluation of this algorithm on OTIS-HHC and OTIS-MOT. Section 5 shows the simulation setup and results. Finally, Section 6 concludes and summarizes the overall work.

2 OTIS optoelectronic architecture

In interconnection networks, the processors are interconnected in such a way that enables these processors to communicate. Thus, the way of connecting them plays an important role in the overall performance of the parallel system. This encourages the researchers in parallel processing field to propose interconnection networks with attractive properties such as low diameter, high bisection width, minimum and maximum node degrees, etc. In this respect, researchers propose hybrid interconnection networks that exploit the best features of two or more interconnection networks. For example, the HHC exploits the low diameter of a hypercube and a static node degree of a ring. Moreover, this mating was also on the link level, where in OTIS optoelectronic architectures, there are two types of links, the optical links that connect groups of processors, and the electronic links that connect processors within each group. This can lower the design complexity of the interconnection network, by applying the iterative structure among processors, such that higher dimension of the factor network can be simulated using lower dimension of the OTIS architecture of this factor network. Therefore, two OTIS optoelectronic architectures with hybrid factor networks were chosen to solve TSP using PRTO algorithm; namely, OTIS-HHC and OTIS-MOT. The following subsections illustrate the structure of OTIS-HHC and OTIS-MOT.

2.1 OTIS-HHC

OTIS-HHC [20] combines the topological structure of OTIS, HHC, and hypercube, where HHC is constructed by building a hexa-cell of 6 processors as the first dimension, and then three more links are added to this one dimensional hexa-cell as shown in Fig. 1a. The labels of the processors in a one-dimensional HHC are assigned in a way such that each processor in the lower triangle has a label that differs by only one significant bit from its direct neighbor in the upper triangle, and therefore, we do not use the label \(<011>\). Thus, a two-dimensional HHC is constructed from two one-dimensional HHC as a hypercube. Each processor in a d-dimensional HHC has a label that is composed of two parts: the first part denotes a sub-label for the one-dimensional HHC; this sub-label refers to the location of the one-dimensional HHC in its corresponding (d-1) hypercube, while the second part denotes the label of the processor in that one-dimensional HHC. Figure 1b shows the 2D-HHC and its labeling. For example, label \(<00,110>\) refers to processor six in one-dimensional HHC sub-group 0, and \(<01,110>\) refers to processor six in one-dimensional HHC sub-group 1, as shown in Fig. 1b. Each one-dimensional HHC can be connected to another one-dimensional HHC, where the first one is called sub-group 0 and the second one is called sub-group 1, as depicted in Fig. 1b.

Fig. 1
figure 1

a One dimensional HHC. b Two dimensional HHC [20]

The architecture of OTIS-HHC is based on connecting n groups of a d-dimensional HHC via optical links, where each group with n or \(P_{G } \)processors. Particularly, d-dimensional OTIS-HHC consists of G groups that are connected via optical links and each one composed of a d-dimensional HHC. OTIS-HHC has two architectural structures. In the first structure, the number of groups equals to half of the number of processors in each group, this case is referred to as \(G=\)P/2, where G is the number of groups and P is the number of processors, as depicted in Fig. 2a. In this figure, the one-dimensional HHC is composed of 3 groups and 6 processors per group. Note that, each group in the one-dimensional HHC has two optical links to connect it with the other groups. In the second structure, the number of groups G in OTIS-HHC equals to the number of processors P that are located in each group, this case is referred to as \(G=P\). Figure 2b depicts \(G=P\) case, where in this figure the one-dimensional HHC is composed of 6 groups and 6 processors per group. Note that, each group in the one-dimensional HHC has 5 optical links to connect it with the other groups. The connection between groups is done by transposing the sequence numbers of processor and group. For example, node \(<1, 5>\) is connected with node \(<5, 1>\) by optical link. This means processor five in first group is connected with processor one in fifth group by optical link.

Fig. 2
figure 2

The architecture of a one-dimensional OTIS-HHC when aG = P and bG = P/2 [20]

2.2 OTIS-MOT

The structure of OTIS-MOT is based on combining the best characteristics of a tree and mesh interconnection networks while throwing out their drawbacks, in addition, they inherit the features of optical communication links. OTIS-MOT combined the attractive features of MOT with a small diameter and large bisection width, and the attractive features of the OTIS interconnection networks. An \(n \times n\) OTIS-MOT contains \(n^{2}\) groups, each one consists of \(n \times n\) or \(P_{G}\) processors. For example, an \(3\times 3\) OTIS-MOT has 9 groups each one consists of 9 processors [21]. The OTIS-MOT lattice is divided into rows and columns of groups of processors. Inside each group, the processors also are organized into rows and columns and connected via electronic links in two ways. The first one is row-wise, where processors in each row are connected in a way to form a binary tree with a root located at the first processor in each row, such that processor (i, j) with row R and column C, are directly connected with processor (i, j) with row R and column 2C, and connected with processor (i, j) with row R and column 2C +1 if they exist. The second way is column-wise, where processors in each column are connected in a way to form a binary tree with a root located at the first processor in each column, such that processor (i, j) with row R and column C are directly connected with processor (i, j) with row 2R and column C, and connected with processor (i, j) with row 2R +1 and column C if they exist, as depicted in Fig. 3 [21]. Different groups are connected via optical links based on the OTIS rule, which connects processor i in group j with processor j in group i.

Fig. 3
figure 3

3× 3 OTIS-MOT [21]

3 PRTO algorithm

Heuristic local search algorithms have achieved good results in solving TSP. They start with a sub-optimal solution and then attempt to improve it until no improvement can be performed. One of the well-known local search algorithms is the 2-opt algorithm. This algorithm tries to improve a sub-optimal solution resulted from any constructive heuristic algorithm. The improvement starts by considering a tour, then performs exchanges between cities in this tour in order to find a better solution. In other words, the 2-opt algorithm considers each pair of cities within the tour and exchange them, if and only if, the new tour is shorter than the old one, this continues until no further improvement can be found, as illustrated in Fig. 4. So, when the generated tour is optimal in the context of the 2-opt algorithm, the process will stop. The time complexity of this algorithm is O(N2), where N is the number of cities.

Fig. 4
figure 4

The sequential 2-opt algorithm

The quality of the improved solution depends directly on the initial route of a given graph. The best initial tour will not guarantee the best improvement tour. Thus, to obtain the best improvement tour of a given graph, a repetitive 2-opt algorithm must be applied on each tour that can be generated from the graph. Then the best route can be selected among all the improved routes with a time complexity equals to O(N3). This approach guarantees obtaining the best quality solution in the context of 2-opt algorithm. However, it needs an iterative process which consumes the available resources and requires a considerable CPU time; particularly in large instances of cities. Therefore, in this section, a repetitive 2-opt heuristic algorithm is selected to be parallelized over OTIS-HHC and OTIS-MOT optoelectronic architectures. The PRTO algorithm for both optoelectronic architectures is illustrated and presented first, then its analytical evaluation is presented in Section 4.

3.1 PRTO algorithm on OTIS-HHC

The PRTO algorithm in this paper is designed based on the interesting features of the selected OTIS optoelectronic architectures; such as the iterative structure among groups, and the existence of optical links between these groups. The algorithm is composed of four phases, namely: load balancing phase, data distribution phase, local repetitive 2-opt phase and data combining phase. The PRTO algorithm on OTIS-HHC in Fig. 5 is illustrated in more details as follows:

Fig. 5
figure 5

PRTO algorithm on OTIS-HHC

Phase 1: Load Balancing Phase :

Assume that processor \(<0, 0>\) in Group 0 (G0) as the Main Coordinator (MC) processor, which handles the balancing of the cities among the number of processors. Therefore, MC processor in \(G_{0}\) will partition the N cities among P processors; such that each processor will take N/P cities. It is important to mention that the TSPLIB contains data instances with various sizes. These sizes are not power of two; so, \(N/P\) will generate fractions that can be considered as extra cities. The main goal of this phase is to make sure that all the extra cities will be distributed among the processor such that only one extra city will be allocated to the balanced processors. The allocated cities for each processor will be stored in the Allocated Cities Array (ACA).

Phase 2: Data Distribution Phase :

The Main Group (MG) which is \(G_{0}\) in OTIS-HHC contains the MC processor that holds the Distance Matrix (DM). The DM must be distributed to all other processors in the optoelectronic architecture. DM represents the distances between the cities in the graph. Moreover, each group contains Group Coordinator (GC) processor, which is the processor that connects group \(G_{i}\) with the MG via optical link.

The data distribution phase is composed of the following steps:

  1. I

    Electronic Main Group Distribution (lines 2-6 in Fig. 5): At this phase, MC processor will send a copy of the DM of size \(N^{2}\), where N is the number of cities, to all other processors in the optoelectronic architecture. The need to send the whole matrix to all processors in the optoelectronic architecture is for generating an initial tour using the nearest neighbor algorithm in order to be improved using the selected local search algorithm. In addition to that, MC processor will send ACA to all processors in OTIS-HHC. Consequently, in intra HHC distribution MC processor initiates a one to all broadcast in the one-dimensional HHC of sub-group 0, to send DM through the electronic links by utilizing the difference in one-bit position of the processors labels. This process is applied in three steps as illustrated in Fig. 6a. Processor 000 will send DM and ACA to processor 001, then processor 001 will send DM and ACA to processor 010 and finally, processors in the upper triangle will send (in parallel) DM and ACA to processors in the lower triangle. Now, in d-dimensional distribution, all processors in one-dimensional HHC will send (in parallel) both DM and ACA to their connected processor in the other HHC sub-groups. For example, the processor with label \(<0, 010>\) will send the communicated message to the processor with label \(<1, 010>\), as depicted in Fig. 6b.

  2. II

    Optical Distribution of Data (lines 7-8 in Fig. 5): All processors in MG (in parallel) start sending DM and ACA through the optical links to their transpose processors in other groups. This will require one OTIS move. At the end of this phase, all GC’s processors will have a copy of DM and ACA.

  3. III

    Inter Group Distribution of Data (lines 9-10 in Fig. 5): Steps 2-5 in Fig. 5 will be repeated by each GC processor in each group in OTIS-HHC. Intra group distribution and d-dimensional distribution will be accomplished by each GC in parallel to resend DM and ACA to all other processors in the optoelectronic architecture. Finally, at the end of this phase, P copies of DM and ACA are distributed to P processors in the optoelectronic architecture.

Fig. 6
figure 6

Distribution of Distance Matrix (DM) on HHC

Local Repetitive 2-Opt Phase (lines 11-12 in Fig. 5):

All processors in OTIS-HHC, in parallel, will apply the sequential 2-opt algorithm on each city that belongs to the set of cities that were allocated for each processor in the load balancing phase. The algorithm will be applied many times based on considering different city as a starting city every time, resulting in different routes for each starting city stored in a Route Matrix (RM); such that each processor will have its own RM

Phase 4: Data Combining Phase :

Since the structure of OTIS-HHC constructed by combining the structural properties of the HHC and hypercube, then each HHC graph will consist of HHC sub-groups that differ in one bit position of their labels. Therefore, the combining phase will exploit these features of OTIS-HHC, where three combining levels will be adopted based on three processors. They are MC processor, GC processor, and Sub-Group Coordinator processor (SGC), where SGC is the processor with label \(< 000>\) in each sub-group. Thus, there will be multiple SGC s within the same group of OTIS-HHC. Thus, the following are the required steps for data combining phase:

  1. I

    Inter Group Data Combining (lines 13-18 in Fig. 5). This step will combine all the RMs from all the processors in all groups to their associated GC processors via electronic links. At intra HHC, combining each processor in one-dimensional HHC will route its RM to the SGC processor via a gather scheme. This can be done by overturning the steps applied in Fig. 6a. Now, in d-dimensional HHC combining, each SGC processor will send the gathered RM s to its directly connected SGC processor. At the end of this step, all GCs processors in OTIS-HHC will have the accumulated RM s as a Group Route Matrix (GRM), which will be sent through the optical links. Note that the size of the communicated message will be enlarged based on the size of combined matrices.

  2. II

    Optical Data Combining (lines 19-20 in Fig. 5). All GCs processors in the whole OTIS-HHC, except GC processor in \(G_{0}\), will send their GRMs via optical links to their corresponding processors in the main group \(G_{0}\).

  3. III

    Main Group Data Combining (lines 21-22 in Fig. 5). This step will repeat the steps in lines 13-18 of Fig. 5 with the exception of \(G_{0}\), where each processor will combine the collected GRM with its own RM and sends it to its directly connected processor using a gather scheme, then in d-dimensional HHC combining operation, all SGC processors will send the collected GRM s to their directly connected SGC processor. The process continues till the MC processor has collected all GRMs from all processors in \(G_{0}\). At the end of this communication, MC processor can now find the minimum route among all the collected routes.

3.2 PRTO algorithm on OTIS-MOT

PRTO algorithm on OTIS-MOT consists of four phases; load balancing phase, distribution phase, sequential 2-opt phase and combining phase. Only the distribution and combining phases represent the communication time in OTIS-HHC and OTIS-MOT. These phases depend directly on the topological structure of these optoelectronic architectures. Consequently, only the combining and distribution phases will be illustrated in more details in this section.

Phase 2: Data Distribution Phase :

The data distribution phase is composed of the following steps:

  1. I

    Electronic Main Group Distribution (lines 2-6 in Fig. 7): In the data distribution phase, MC processor will send a copy of DM and ACA to all other processors in the OTIS-MOT. The distribution will be carried out using one to all broadcast in two phases. The first one is the row-wise phase, where MC processor will start the communication using one to all broadcast to all other processors in the same row tree. The second phase is the column-wise phase, where one to all broadcast will be initiated by each processor that received DM in the earlier phase. This one to all broadcast will send DM to all processors in the same column tree of the sender processor through electronic links. Each phase will take\( \log n\) steps. Thus, the total electronic main group distribution will need \( 2\log n\) communication steps. Now, all the nodes that received DM and ACA in MG start the optical distribution of data as shown in the next step.

  2. II

    Optical Distribution of Data (lines 7-8 in Fig. 7): All nodes in \(G_{0}\) (in parallel) start sending DM and ACA through the optical links to their transpose processors in other groups. This will require one OTIS optical move.

  3. III

    Inter Group Distribution of Data (lines 9-10 in Fig. 7): Steps 2-5 in Fig. 7 will be repeated by each GC processor in each group in OTIS-MOT. Row-wise and column-wise distributions will be accomplished by each GC in parallel to resend DM and ACA to all other processors in the optoelectronic architecture.

Phase 4: Data Combining Phase :

Data combining phase is done by overturning the order of steps in the distribution phase as follows:

  1. I

    Inter Group Data Combining (lines 17-18 in Fig. 7). This step will combine all the Route Matrices (RMs) from all the processors in all groups to their associated GC processors via electronic links. At column-wise combining, each processor in the same group and located in the last row of this group will route its RM to all other processors in the same column tree via a gather scheme. Now, each processor in the same row tree of the GC processor will perform a gather scheme to send the gathered RMs to all other processors. At the end of this step, all GCs processors in OTIS-MOT will have the accumulated RMs as a Group Route Matrix (GRM), which will be sent through the optical links.

  2. II

    Optical Data Combining (lines 17-18 in Fig. 7). All GCs processors in whole OTIS-MOT, except GC processor in \(G_{0}\), will send their GRMs via optical links to their corresponding processors in the main group \(G_{0}\).

  3. III

    Main Group Data Combining (lines 19-20 in Fig. 7). This phase will repeat steps in lines 13-16, but now in group \(G_{0}\) where each processor will combine the collected GRM with its own RM and sends it to its directly connected processor using a gather scheme. The process continues till the MC processor has collected all GRMs from all processors in \(G_{0}\).

Fig. 7
figure 7

PRTO algorithm on OTIS-MOT

4 Analytical evaluation

This section provides the analytical evaluation of the PRTO algorithm on OTIS-HHC and OTIS-MOT optoelectronic architectures. The following performance metrics are used to evaluate the algorithm: parallel time complexity, speedup, efficiency, cost, and communication cost.

4.1 Analytical evaluation of PRTO algorithm on OTIS-HHC

In this section, the PRTO algorithm on OTIS-HHC is evaluated analytically in terms of parallel time complexity, speedup, efficiency, cost, and communication cost.

4.1.1 Parallel time complexity

The communication time plus the computation time, both represent the parallel run time. The parallel time complexity of PRTO algorithm is shown in Theorem 1.

Theorem 1

The worst-case parallel time complexity of PRTO algorithm on OTIS-HHC is shown in (1), whereTis theparallel time complexity,N is the number of cities, P is the number of processors anddh is the dimension of HHC.

$$ T (N, P)= {\Theta} \left( P+ \frac{N^{3}}{P}+N^{2}\times dh\right) \quad \textup{when } G=P/2 \textup{ and } G=P $$
(1)

Proof

The PRTO algorithm on OTIS-HHC (Fig. 5) is evaluated in terms of parallel run time complexity for all phases as demonstrated in Table 1a–d.

Table 1 The parallel run time complexity for all phases of the PRTO algorithm on OTIS-HHC architecture

The overall parallel run time complexity of phases 1–4 is shown in (2) for \(G=\)P/2 and (3) for \(G=P\). Thus, (2) and (3) can be reduced to (4).

$$\begin{array}{@{}rcl@{}} T(N, P) &=& {\Theta} (P)+ {\Theta} (N^{2} \times dh)+ {\Theta} (\frac{N}{P}\times N^{2})\\ &&+{\Theta} (\frac{N^{2}}{P}\times (P_{G}{\times dh}^{2}+P_{G} \times dh\mathrm{))} \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} T(N, P) &=& {\Theta} (P)+ {\Theta} (N^{2}\times dh)+ {\Theta}(\frac{N}{P}\times N^{2})\\ &&+{\Theta} (\frac{N^{2}}{P}\!\times\! (13\!\times\! P_{G} + 6\!\times\! 2^{dh-1}\!\times\! P_{G})) \end{array} $$
(3)
$$ T(N, P) \approx {\Theta}(P+ \frac{N^{3}}{P}+N^{2} \times dh) $$
(4)

4.1.2 Speedup

The speedup is considered as the improvement in speed between the implementation of a given problem on a single processor and the implementation of the same problem on a parallel machine [33]. The speedup of the PRTO algorithm on OTIS-HHC is shown in (5).

$$ S= \frac{N^{3}\times P}{P^{2}+N^{3}+N^{2}\times dh \times P} $$
(5)

4.1.3 Efficiency

Efficiency is represented as the ratio between the speedup and the number of processors in a parallel architecture [33]. Therefore, the efficiency of the PRTO algorithm on OTIS-HHC is shown in (6).

$$ E=\frac{N^{3}}{P^{2}+N^{3}+N^{2}\times dh\times P} $$
(6)

4.1.4 Cost

The cost is the total time needed by all the processors in any parallel architecture to solve any problem. It can be obtained by multiplying the total number of processors in the parallel architecture with the parallel time, where \(Cost=P \times T_{P}\) [33]. \(T_{P}\) is the parallel run time of the PRTO algorithm which is presented in (4), and P is the number of processors in the optoelectronic architecture. So, the cost of the PRTO algorithm on OTIS-HHC for \(G=P\) and \(G=P\)/2 is shown in (7).

$$ Cost=P^{2}+N^{3}+N^{2}\times dh\times P $$
(7)

4.1.5 Communication cost

The communication cost represents the total number of links in a parallel architecture that were used during solving the task [33]. In our case, it is the total number of electronic links and optical links used to apply PRTO algorithm on OTIS-HHC. Since the communication of PRTO algorithm is represented in both the distribution and the combining phases, then the communication cost of this algorithm depends directly on the adopted communication pattern in these two phases. As shown in Table 2, electronic main group distribution in OTIS-HHC requires \(5 + 6\times (2^{dh-1}\)-1) electronic links. This can be obtained by adding the number of links in the intra HHC distribution and the number of links in the d-dimensional HHC distribution. In the former case, the number of utilized links is equal to 5, while in the latter case, the number of links can be found by calculating the number of links needed by the hypercube for each dh step, when \(dh\ge 2\). And this is equal to \({\sum }_{i = 0}^{dh-2} {\mathrm {6\times } 2^{i}}\).

Table 2 Communication cost of PRTO algorithm on OTIS-HHC and OTIS-MOT

The optical distribution of data requires \(\frac {6\times 2^{dh-1}}{2}-1 \)optical links, which is equal to the number of groups inside OTIS-HHC minus one; the number of groups here represents the case when \(G=P\)/2 [20]. For the case when \(G=P\), the number of groups equals to \((6\times 2^{dh-1}-1)\) [20]. Now, in intergroup distribution of data, the number of the utilized electronic links can be obtained by multiplying the number of groups and the number of electronic links that was utilized during the distribution inside one group, as depicted in Table 2. Note that the communication cost for the combining phase is the same as the communication cost for the distribution phase. So, the total communication cost of applying PRTO algorithm on OTIS-HHC is illustrated in (8) and (9).

$$ {Comm}_{Cost}= 2\times (18\times 2^{2dh-2}-1) \text{ for } G=P/2 $$
(8)
$$ {Comm}_{Cost}= 2\times (36\times 2^{dh-1}-1) \text{ for } G=P $$
(9)

4.2 Analytical evaluation of PRTO algorithm on OTIS-MOT

In this section, the PRTO algorithm on OTIS-MOT is evaluated analytically in terms of parallel time complexity, speedup, efficiency, cost, and communication cost.

4.2.1 Parallel time complexity

The time complexity of PRTO algorithm is captured in Theorem 2.

Theorem 2

The worst-case parallel time complexity of PRTO algorithm on OTIS-MOT is shown in equation (10), whereTis the parallel time complexity,N is the number of cities, P is the number of processors andN is the number ofprocessors at each row in the MOT within one group.

$$ T (N, P) = {\Theta} (P + \frac{N^{3}}{P}+N^{2}\times 2\log n) $$
(10)

Proof

The parallel run time complexity for all phases of PRTO algorithm on OTIS-MOT is evaluated analytically by tracing the algorithm in Fig. 7 line by line, as demonstrated in Table 3a–d.

Table 3 The parallel run time complexity for all phases of the PRTO algorithm on OTIS-MOT architecture

The overall parallel run time complexity of phases 1–4 is shown in (11).

$$\begin{array}{@{}rcl@{}} T\!(N\!,\! P) \!&=&\! {\Theta} (P)+ {\Theta} (N^{2}\times 2\log n)\\ &&+ {\Theta}(\frac{N}{P}\!\times\!\! N^{2}) + {\Theta} (\frac{n \!\times\!(\log n)^{2} \!\times\! N^{2} \times P_{G}}{P}) \end{array} $$
(11)

Equation (11) can be reduced to (12).

$$ T(N, P) \approx {\Theta} (P+ \frac{N^{3}}{P}+N^{2}\times 2\log n) $$
(12)

4.2.2 Speedup

The speedup of the PRTO algorithm on OTIS-MOT is shown in (13).

$$ S= \frac{N^{3}\times P}{P^{2}+N^{3}+N^{2}\times 2\log n \times P} $$
(13)

4.2.3 Efficiency

The efficiency of the PRTO algorithm on OTIS-MOT is shown in (14).

$$ E=\frac{N^{3}}{P^{2}+N^{3}+N^{2}\times 2\log n\times P} $$
(14)

4.2.4 Cost

The cost of PRTO algorithm on OTIS-MOT is shown in (15).

$$ Cost=P^{2}+N^{3}+N^{2}\times \sqrt{P_{G}} \times P $$
(15)

4.2.5 Communication cost

As depicted in Table 2, electronic main group distribution in OTIS-MOT requires \((\sqrt P_{G} -1)+(P_{G}-\sqrt P_{G} )\) electronic links. In OTIS-MOT, the one to all broadcast is performed among two phases, the row-wise phase, and the column-wise phase. For instance, consider OTIS-MOT with 256 processors where each group contains 16 processors. Then, in row-wise phase, \(\sqrt P_{G} -1\) electronic links are required to distribute the DM to all processors within the same row of MC processor. In the column-wise phase, 12 electronic links will be utilized to send the received DM to all processors in parallel. This can be calculated by multiplying the number of processors in one row inside one group by the number of processors in one row inside one group minus one. The optical distribution of data requires \(P_{G}\) – 1 optical links, which is equal to the number of groups inside OTIS-MOT minus one.

Now, in inter group distribution of data, the number of the utilized electronic links is equal to \({{P_{G}^{2}}-2P}_{G}+ 1\), which was obtained by multiplying the number of groups minus one and the number of electronic links that were utilized during the distribution inside one group. Therefore, the total communication cost of applying PRTO algorithm on OTIS-MOT is illustrated in (16).

$$ {Comm}_{Cost}= 2(P_{G}^{2}-1) $$
(16)

The analytical evaluation of PRTO algorithm on OTIS-HHC and OTIS-MOT is summarized in Table 4, where \(T_{D } \) denotes the time needed for distribution, \(T_{C } \) denotes the time needed for combining phase, \(T_{P } \) denotes the parallel run time, S denotes the speedup, E denotes the efficiency, \(C \) denotes the cost, CC denotes the communication cost.

Table 4 Parallel run time complexity, speedup, efficiency, cost, and communication cost of PRTO algorithm on OTIS-HHC and OTIS-MOT

5 Simulation results

Mainly, in this section, we present and discuss the simulation results acquired from applying the PRTO algorithm on OTIS-HHC and OTIS-MOT optoelectronic architectures.

5.1 Simulation setup

To assess the performance of PRTO algorithm on OTIS-HHC and OTIS-MOT optoelectronic architectures, an object-oriented Java based simulation has been implemented using Java jdk8 under the Eclipse environment. All simulation experiments evolved on Intel (R) Core (TM) i7, 3.2 GHz Processor with 16 GB RAM, 4 MB cache memory, and Windows 8.1 64-bit as an operating system. To carry the simulation, we use a startup time equals to 55µs [34], a speed of electronic links equals to 250Mb/s, and a speed of optical links equals to 2.5Tb/s [35]. The average run is calculated after repeating each run case ten times.

The following classes have been developed:

  • DM class for reading different data instances of the TSPLIB [13].

  • Repetitive 2-opt class for performing the sequential 2-opt algorithm.

  • Load balancing class for balancing N cities among P processors.

  • OTIS-HHC class for constructing the endorsed dimensions of OTIS-HHC.

  • OTIS-MOT class for constructing the endorsed dimensions of OTIS-MOT.

The size ranges for OTIS-HHC, OTIS-MOT, MOT and HHC architectures are shown in Table 5. These ranges include processors with size varies from 16 to 2304 processors, each range was named with a class as illustrated in this table. For instance, OTIS-HHC (G = P/2) of class B has 72 processors, while OTIS-MOT of the same class has 81 processors. These ranges were chosen based on the topological structure of each architecture, where each dimension can have a number of processors which is obtained using the equations illustrated in Table 6. Where dh is the dimension of the HHC and n is the number of processors in each row of MOT. To carry the simulation runs, several data instances of the TSPLIB were chosen to test the algorithm performance. The selected instances have sizes varying from 29 cities to 6880 cities.

Table 5 Architectures size ranges
Table 6 Size equations of various architectures

5.2 Comparative evaluation

In this section, we compare and present a detailed discussion of the simulation results of the PRTO algorithm for solving TSP on OTIS-HHC and OTIS-MOT architectures using various number of processors and various TSP data instances. Also, we compare the performance of PRTO algorithm with the performance of PRNN algorithm in terms of speedup and efficiency over OTIS-HHC and OTIS-MOT architectures. Additionally, we present a realistic comparison in terms of speedup and efficiency, between PRTO algorithm on OTIS-HHC (G = P/2) architecture and MMAS algorithm on Sunway Blue Light supercomputer.

The time needed by each processor to perform any task is considered as the computation time, in our case, it is the time in seconds to find the route using 2-opt algorithm for each city belongs to \(N/P \)cities assigned to each processor.

As shown in Fig. 8, as the number of processors increases along the adopted classes (i.e. network size ranges), the computation time decreases. For more illustration, applying PRTO algorithm on OTIS-HHC (G = P/2) optoelectronic architecture in class A, which is the size range 16-36 processors, recorded less computation time than applying PRTO algorithm on OTIS-MOT optoelectronic architecture; this refers to the 18 processors of OTIS-HHC (G = P/2) involved in the computation, compared to the 16 processors of OTIS-MOT. In class B, which is the size range of 72-144 processors, applying PRTO algorithm on OTIS-MOT had less computation time than applying it on OTIS-HHC (G = P/2) due to the 81 processors of OTIS-MOT and 72 processors of OTIS-HHC (G = P/2); that is OTIS-MOT has 9 additional processors than OTIS-HHC (G = P/2). In class C, which is the size range of 256-576 processors, applying PRTO algorithm on OTIS-HHC (G = P/2) recorded slightly less computation time than applying it on OTIS-MOT due to the 288 processors in OTIS-HHC (G = P/2) and 256 processors in OTIS-MOT. Correspondingly, in class D, which is the size range of 1152-2304 processors, applying PRTO algorithm on both OTIS-MOT and OTIS-HHC (G = P/2) recorded same computation time; this relates to the load balancing phase, which guarantees the allocation of \(N/P\) cities for some processors and the allocation of (N/P) +1 cities for other processors. Consequently, dividing 6880 cities among 1152 processors in the case of OTIS-HHC (G = P/2) and on 1296 processors in the case of OTIS-MOT, will result in the allocation of 5 cities for some processors and the allocation of 6 cities for other processors in both optoelectronic architectures.

Fig. 8
figure 8

Computation time of applying PRTO algorithm on OTIS-HHC and OTIS-MOT for xsc6880TSP data instance

Thus, we observed that in all classes the optoelectronic architecture with a higher number of processors gained less computation time than the one with the lesser number of processors. This can be referred to the fact that keeping the problem size constant while increasing the number of processors will assign a smaller portion of data to processors from one class to another. Consequently, the task will be applied on smaller data, resulting in less computation time. As a result, the number of processors plays the exclusive contribution of the computation time.

As obvious from Fig. 8, OTIS-HHC, in case of \(G=P\), is superior in terms of computation time in all the adopted classes. This is due to the number of processors offered by this case in OTIS-HHC. For example, in class A, OTIS-HHC (G = P) provided 36 processors while OTIS-HHC (G = P/2) and OTIS-MOT provided 18 and 16 processors, respectively.

Figure 9 demonstrates the communication time for applying PRTO algorithm on OTIS-HHC and OTIS-MOT optoelectronic architectures. In general, the communication time increases while the number of processors in each optoelectronic architecture increases. In classes A and C, applying PRTO algorithm on OTIS-HHC (G = P/2) gained higher communication time than applying it on OTIS-MOT, since in class A, OTIS-MOT recorded 5 communication steps in the distribution phase, while OTIS-HHC recorded 7 communication steps in this phase, and in class C, OTIS-MOT recorded 9 communication steps in the distribution phase, while OTIS-HHC recorded 11 communication steps in this phase, as illustrated in Table 7.

Fig. 9
figure 9

Communication time of applying PRTO algorithm on OTIS-HHC and OTIS-MOT for xsc6880 TSP data instance

Table 7 Communication steps of the optoelectronic architectures

A careful examination of the behavior of the communication time between class B and class C for OTIS-MOT exposes that it has been slightly degraded, despite the number of communication steps, which is equal to 9 communication steps in both classes, as shown in Table 7. The reason is that if we eliminate the communication steps, then the increment in the number of processors between these two classes will cause this degradation, since when the number of processors increases the communicated message size decreases; thus, the combining operation time will decrease too. In classes B, and D, OTIS-HHC (G = P/2) and OTIS-MOT recorded almost the same communication time since they share comparable communication steps equals to 9 in class B and 13 in class D, as shown in Table 7. Regarding OTIS-HHC for both cases \(G=P\)/2 and \(G=P\), as shown in Fig. 9, they recorded near communication time. This is because OTIS-HHC when \(G=P\) has a higher number of processors, but it can get the same diameter of \(G=P\)/2 with a lower number of processors and hence the same number of communication steps, as shown in Table 7.

However, it was not sufficient to conclude which architecture offers better communication time based on only those four classes. Thus, we performed additional simulation runs on OTIS-MOT and OTIS-HHC (G = P/2) with higher number of processors. The extra simulation runs were used 4608 processors of OTIS-HHC and 4096 processors of OTIS-MOT and data instance of size 6880 cities. The results were along the side of OTIS-HHC (G = P/2) with 21.5 seconds communication time while OTIS-MOT recorded 25 seconds communication time. The extra simulation runs were repeated with larger data instance of 19289 cities, where OTIS-HHC (G = P/2) recorded 170.7 seconds communication time while OTIS-MOT recorded 199.2 seconds communication time. Large difference was noticed in the communication time with a larger number of processors for these two optoelectronic architectures. Consequently, OTIS-HHC (G = P/2) gained this excellence since it recorded less communication time using 512 more processors than OTIS-MOT.

Broadly speaking, the diameter and number of processors in each group of OTIS-HHC and OTIS-MOT played an important role in the number of communication steps required for the distribution and combining phases. Particularly, when the dimension of OTIS-HHC (G = P/2) turned out to be larger, with 17, 19, and 21 communication steps in the 6th, 7th, 8th dimensions of OTIS-HHC (G = P/2), respectively. In contrast to this and in the same size ranges of OTIS-HHC (G = P/2), OTIS-MOT has 23, 35, and 45 communication steps. We can conclude that OTIS-HHC (G = P/2) provides less communication time than OTIS-MOT, since the topological structure of OTIS-HHC utilizes the hypercube in its factor network, while OTIS-MOT utilizes the mesh in its factor network. Thusly, OTIS-HHC features lower diameter than OTIS-MOT, resulting in less communication time.

Parallel time includes the computation time and the communication time, so it is the time for the whole parallel process. A careful look at the results shown in Fig. 10, reveals that in class A OTIS-HHC (G = P/2) provides less parallel time than OTIS-MOT. This is because OTIS-HHC (G = P/2) got less computation time which recovers the increment in the communication time, so the parallel time was less. In contrast, OTIS-MOT in class B obtained better parallel time than OTIS-HHC (G = P/2), and in classes C and D, OTIS-HHC (G = P/2) recorded a slightly better parallel time. A conclusion can be drawn regarding the parallel time for OTIS-HHC (G = P/2) based on the discussion of the communication time results. As previously illustrated, less communication time was recorded by OTIS-HHC (G = P/2) with higher dimensions. Accordingly, the parallel time will be influenced in a better way as the communication time gets less. Thus, applying PRTO algorithm on OTIS-HHC (G = P/2) recorded in general smaller parallel time than OTIS-MOT.

Fig. 10
figure 10

Parallel time of PRTO algorithm on OTIS-HHC and OTIS-MOT forcsx6880 TSP instance

As obvious from Fig. 10, OTIS-HHC (G = P) recorded the lowest parallel time and is prominent all classes with a large gap. For example, in class C, OTIS-HHC (G = P) achieved 76 seconds for the parallel time while OTIS-HHC (G = P/2) and OTIS-MOT achieved 131 and 132 seconds, respectively. Consequently, OTIS-HHC (G = P) offers less computation time without adding excessive communication. Therefore, this outcome suits well with the applications that require intensive computations, since it can provide high number of processors with good communication time.

Speedup results are demonstrated in Fig. 11. In general, the speedup in this figure grows as the number of processors increases along the selected size ranges. Applying PRTO algorithm on OTIS-HHC (G = P/2) has achieved high speedup equals to 16.7, 66.2, and 231.5 in classes A, B, and C, respectively. Also, applying PRTO algorithm on OTIS-MOT has achieved high speedup equals to 15, 75.7, and 229.8 in classes A, B, and C, respectively.

Fig. 11
figure 11

Speedup of PRTO algorithm on OTIS-HHC and OTIS-MOT for xsc6880 TSP instance

Before starting the discussion on each class separately, it seems to be worthwhile to emphasize the OTIS-HHC (G = P) speedup results. As demonstrated in Fig. 11, OTIS-HHC (G = P) achieved the best speedup among all the selected architectures with a large gap. This is due to the aforementioned reasons of its superiority in providing high number of processors with less parallel run time, so OTIS-HHC (G = P) deserve this excellence.

Now, let us focus our discussion on each class separately, according to Fig. 11. Beginning with class A, you can observe that the speedup in class A for applying PRTO algorithm on OTIS-HHC (G = P/2) is better with a small difference than OTIS-MOT, since OTIS-HHC (G = P/2) in this class has 18 processors while OTIS-MOT has 16 processors. Thus, partitioning the cities among 18 processors will yield fewer cities for each processor than partitioning them among 16 processors. This will directly influence the computation time for each processor, which will be less in the case of OTIS-HHC (G = P/2) than OTIS-MOT architecture in this class. So, the whole parallel time will be reduced since the decrement in the computation time compensated the increment in the communication time. In class B, OTIS-MOT achieved better speedup with a small difference also, since it has less parallel time than OTIS-HHC (G = P/2) in this class. It is obvious from the figure, OTIS-HHC (G = P/2) obtained better speedup with a small difference in classes C and D than OTIS-MOT. Again, it was not sufficient to conclude which optoelectronic architecture offers better speedup based on only these 4 classes. Thus, the speedup of the previous additional simulation runs of 6880 cities on OTIS-MOT using 4096 processors and on OTIS-HHC (G = P/2) using 4608 processors were calculated. The results show that OTIS-HHC (G = P/2) recorded 1016 speedup, while OTIS-MOT recorded 906 speedup. In regards of this and as concluded previously from the parallel time results where OTIS-HHC (G = P/2) achieved better parallel time using higher dimensions, we can say that applying PRTO algorithm on OTIS-HHC recorded in general better speedup than OTIS-MOT.

The speedup for class D for the selected TSP data instances of sizes 1304, 1389, 1817, 2036, 2175, 3038, 3891, 4355, 5934 and 6880 cities is shown in Fig. 12. In general, when the number of processors is fixed, then as the data size increases the speedup increases too, as shown in Fig 12. This relates to the fact that the communication time and the computation time will increase as the data size increase, and hence the parallel time. On the other hand, the sequential time required to solve the problem will increase as the data size increases. Therefore, the speedup will increase since it also represents the ratio between the sequential time and the parallel time.

Fig. 12
figure 12

Speedup of applying PRTO algorithm on OTIS-HHC and OTIS-MOT for class D of various TSP data instances

Figure 13 illustrates the advantage of the selected optoelectronic architecture over their factor networks. The network size was considered based on the given sizes by each factor network. As shown from Fig. 13, the number of processors in each architecture played an important role in this result. For example, in HHC dividing 6880 cities among 768 processors will assign 9 cities to the balanced processors. While dividing 6880 cities among 1152 processors of OTIS-HHC when \(G=P\)/2, will assign 6 cities to the balanced processors. This can lower the computation time particularly with the 2-opt algorithm where one extra city can change the computation time highly.

Fig. 13
figure 13

Speedup of applying PRTO algorithm on OTIS-HHC (G = P), OTIS-HHC (G = P/2), OTIS-MOT, HHC, and MOT architectures for xsc6880 data instance

In addition to the computation time, the communication time also played an important role in this result. For instance, OTIS-HHC when \(G=P\) recorded the highest speedup compared to OTIS-HHC when \(G=P\)/2, HHC, MOT, and OTIS-MOT. This is because OTIS-HHC when \(G=P\) can offer a higher number of processors with the same diameter of \(G=P\)/2 and hence the same number of communication steps. This is also applied for OTIS-MOT as illustrated previously in the discussion of the communication time results. Consequently, OTIS-HHC gained this excellence since it recorded less communication time with the higher number of processors.

In general, OTIS-MOT recorded better speedup than its factor network (MOT) and OTIS-HHC recorded better speedup than its factor network (HHC), in all classes of size ranges. This is due to processors grouping feature of OTIS optoelectronic architecture with two types of links, the optical links that connect a group of processors, and the electronic links that connect processors within each group. As a result, this feature can lower the design complexity of the interconnection network, by applying the iterative structure among processors, such that higher dimension of the factor network can be simulated using lower dimension of the OTIS architecture of this factor network.

The efficiency of applying PRTO algorithm on OTIS-HHC and OTIS-MOT is depicted in Fig. 14. Generally, the efficiency in this figure is decreased as the number of processors increases, this is due to the degradation in the ratio between the speedup and the number of processors as they are increased. A careful examination of class A in this figure denotes the achievement of excellent efficiency, which approaches to 0.92 for applying PRTO algorithm. In both classes A and B, OTIS-HHC (G = P/2) and OTIS-MOT achieved comparable efficiency, while in class C, OTIS-MOT achieved better efficiency than OTIS-HHC (G = P/2), due to the speedup results, where both optoelectronic architectures recorded almost the same speedup in this class, as shown in Fig. 11. The number of processors contributes in the differences of efficiency between the two architectures. In OTIS-HHC (G = P/2), dividing 230 speedup on 288 processors will result in less efficiency than in OTIS-MOT with 256 processors. Regarding class D, OTIS-HHC (G = P/2) achieved better efficiency, for the aforementioned reason, where OTIS-HHC (G = P/2) has 1152 processors and OTIS-MOT has 1296 processors. Therefore, the number of processors in the optoelectronic architecture played an important role in the efficiency results. Regarding OTIS-HHC (G = P), the lowest efficiency is achieved by it in all the adopted classes. This is because OTIS-HHC (G = P) offers high number of processors in each class. Thus, dividing the obtained speedup on high number of processors will yield lowest efficiency.

Fig. 14
figure 14

Efficiency of PRTO algorithm on OTIS-HHC and OTIS-MOT for xsc6880 TSP instance

Figure 15 illustrates the cost of the PRTO algorithm on OTIS-HHC and OTIS-MOT, where the cost represents the result of multiplying the number of processors in the optoelectronic architecture by the parallel time required to perform PRTO on this architecture. As shown in the figure, in class A, OTIS-HHC recorded slightly higher cost than OTIS-MOT even though OTIS-HHC has less parallel time in this class than OTIS-MOT. The reason comes from the higher number of processors for OTIS-HHC in class A, and similarly in class B, but in class C the cost of OTIS-HHC is much higher than the cost of OTIS-MOT since the number of processors in OTIS-HHC is much higher than in OTIS-MOT. Regarding class D, OTIS-HHC \((G=P/2)\) recorded less cost than OTIS-MOT, since both optoelectronic architectures recorded almost the same parallel time, but OTIS-HHC has 1152 processors while OTIS-MOT has 1296 processors in this class. Therefore, the difference in the number of processors between these optoelectronic architectures contributes to this variation of results. In general, if the number of processors in an optoelectronic architecture higher than other optoelectronic architecture, then the cost will be higher too.

Fig. 15
figure 15

Cost of PRTO algorithm on OTIS-HHC and OTIS-MOT for xsc6880 TSP instance

Table 8 illustrates the behavior of the communication cost as the number of processors increases. It is obvious from the table that, as the network size increases the communication cost will increase too. This table shows clearly that the communication cost for OTIS-HHC \((G=P)\) utilized the largest number of optical and electronic links since it has the largest number of processors compared to OTIS-HHC \((G=P/2)\) and OTIS-MOT. OTIS-MOT recorded less communication cost than OTIS-HHC \((G=P/2)\) in classes A and C; that is in network size ranges 16-36 and 256-576, respectively. This can relate to the fact that OTIS-HHC requires less communication steps and hence less communication time, which utilizes a lower number of electronic and optical links during the communication phases. Moreover, in classes A and C, the number of processors for OTIS-MOT is less than the number of processors in OTIS-HHC \((G=P/2)\); therefore, the number of electronic and optical links is smaller. While in classes B and D, OTIS-HHC \((G=P/2)\) recorded less communication cost than OTIS-MOT due to the same reasons stated previously.

Table 8 Communication cost of PRTO algorithm on OTIS-HHC and OTIS-MOT architectures

The PRTO algorithm starts with initial tours in order to find a TSP solution, these initial tours are constructed using the nearest neighbor algorithm and are considered as a pre-processing phase, where it requires a pre-processing computation time for completion. Thus, in order to show the effect of this pre-processing computation time on parallel time, speedup and efficiency, we did run the PRTO algorithm to solve TSP on OTIS-HHC \((G=P)\) architecture for various sizes; that is for a various number of processors, using xsc6880 TSP data instance. So, Table 9 shows the pre-processing computation time, the PRTO computation time, the total computation time, which is the pre-processing computation time plus the PRTO computation time, the PRTO communication time, and the parallel time with and without pre-processing computation time. In general, the parallel time is the computation time plus the communication time. As a result, the difference between including the pre-processing computation time and without including it to the parallel time is small, which is between 2.26% and 3.51% for various OTIS-HHC sizes. Additionally, Table 10 shows the effect of both including and not including the pre-processing computation time to the parallel time on speedup and efficiency. As a result, including the pre-processing computation time to the parallel time slightly lowers down the speedup and efficiency, due to the increment in the computation time.

Table 9 Parallel time with pre-processing time on OTIS-HHC \((G=P)\) architecture using xsc6880 TSP data instance
Table 10 Speedup and efficiency with pre-processing time on OTIS-HHC \((G=P)\) architecture using xsc6880 TSP data instance

In the context of optimization cost, the PRTO algorithm depends on the existence of initial solutions (initial tours) that can be improved to obtain better ones. Thus, each processor will generate a number of initial solutions based on the allocated cities to each processor. Afterwards, the generated solutions will be improved using the PRTO algorithm. Table 11 shows the time needed to generate the initial solutions which is denoted as \(Time_{gen}\), which is the pre-processing computation time that has been mentioned earlier, and the time needed to improve the solutions using the PRTO algorithm which is denoted as \(Time_{imp}\), which is the parallel time of PRTO algorithm not including the pre-processing computation time. Also, Table 11 shows the optimization cost of the PRTO algorithm running on OTIS-HHC \((G=P/2)\) when the number of processors equals to 1152 (class D) using various data instances for solving TSP. Optimization cost is the difference between the time it took to improve the solutions using PRTO algorithm and the time it took to generate the initial solutions before any improvement, as shown in (17).

$$ Optimization_{Cost} = Time_{imp} - Time_{gen} $$
(17)

As the number of cities increases the optimization cost increases too, this is due to the increase in the time needed for PRTO algorithm to improve the solutions, as shown in Table 11.

Table 11 Optimization cost results of PRTO algorithm

The combining operation can be performed as a reduce operation rather than a gather operation, this will influence the obtained simulation results in better way, particularly the speedup results. The reason to adopt a gather operation in our data combining phase is to enable the MC processor to use the routes that were generated by the nearest neighbor algorithm as initial tours which can be used in any other heuristic or metaheuristic algorithm for solving the TSP. Another reason is that, if a reduce operation replaces a gather operation, then the behavior of the combining phase will be similar to the distribution phase in which it will be difficult to explain various factors that affect the performance of the combining phase [32]. However, an additional simulation run is performed in order to see the improvement of a reduce operation instead of adopting a gather operation. The results show an enhancement on the speedup; that is about 9.3% in case of 2304 processors on OTIS-HHC \((G=P)\), for applying the PRTO algorithm on xsc6880 TSP data instance.

Solution quality can be measured using the percentage deviation [36] as illustrated in (18).

$$ Solution Quality = \frac{Found Solution - Optimal Solution}{Optimal Solution} \times 100\% $$
(18)

Applying 2-opt algorithm as a repetitive manner, each time with a different city instead of applying it from any random city, with the ability to exploit the parallelism provided by the selected optoelectronic architectures, yields solutions with better quality than the sequential version of 2-opt algorithm. Table 12 shows the solutions quality for the selected TSP data instances, where \(\mathrm {O}_{\mathrm {S}}\) denotes the optimal solution, \(\mathrm {B}_{\mathrm {S}}\) denotes the best solution, \(\mathrm {A}_{\mathrm {S}}\) denotes the average solution, \(\mathrm {W}_{\mathrm {S}}\) denotes the worst solution and \({\Delta }\) stands for percentage deviation from the optimal solution.

Table 12 Obtained quality solutions

Applying the 2-opt algorithm using any random initial city gets an average solution quality of 16.5% within the optimal one, as illustrated in Table 12 in \(\mathrm {W}_{\mathrm {S}}{\Delta }\%\). On the other hand, applying PRTO gets an average solution quality of 7.3% within the optimal solution as shown in the table in \(\mathrm {B}_{\mathrm {S}}{\Delta }\%\). More importantly, PRTO algorithm achieved solutions with 4.7% quality within the optimal solution for small data instances, such as 29, 130, and 400 cities. This relates to the fact that the solution quality of heuristic algorithms turned out to be poor as increasing the data size [37]. It is important to mention that the solution quality was not influenced neither by the type of optoelectronic architecture nor by the number of processors, but it is influenced by the data size; therefore, the obtained solutions were similar on OTIS-HHC (G = P and \(G=P/2\)) and OTIS-MOT optoelectronic architectures for all network size ranges.

In order to show the superiority of PRTO algorithm over other recent parallel algorithms that solve TSP, we perform additional simulation runs of the PRNN algorithm, which was parallelized and applied on OTIS-Hypercube and OTIS-Mesh in [32]. To the best of our knowledge, until present, the PRNN algorithm is the only parallel heuristic algorithm that has been used for solving TSP on optoelectronic architectures. Thus, we applied the PRNN algorithm on OTIS-HHC (G = P and \(G=P/2\)) and OTIS-MOT, such that a fair comparison between the performance of PRNN and the performance of PRTO on these two optoelectronic architectures can be accomplished. The comparative evaluation is based on the speedup and efficiency for the same size ranges and xsc6880 data instance, as shown in Table 13.

Table 13 Speedup and efficiency results of PRNN and PRTO over OTIS-HHC and OTIS-MOT

Table 13 shows the speedup and efficiency results of applying PRNN and PRTO algorithms to solve TSP over OTIS-HHC and OTIS-MOT optoelectronic architectures using the data instance xcs6880. This table illuminates the superiority of PRTO algorithm in recording higher speedup and efficiency when compared to PRNN algorithm. For example, in case of OTIS-HHC when \((G=P)\) for size ranges between 1152 and 2304 processors, the speedup of PRTO is 928.9 and the speedup of PRNN is 27; that is PRTO is faster than PRNN by approximately 34 times. For the same case, the efficiency of PRTO is 40%, whereas the efficiency of PRNN is 1%. This is an outcome of its intensive computation which increases the computation time over the communication time; such that raising the ratio between the computation time and the communication time. Contrarily, PRNN algorithm gained lower speedup and efficiency showing that it spent more time in communication rather than in computation; particularly with large network size, where the allocated cities for each processor were minimized as the network size increased.

Moreover, a solution quality is compared for both PRNN and PRTO algorithms on various data instances from the TSPLIB, as shown in Table 14. As demonstrated in this table, the percentage of best, average and worst errors for PRTO algorithm is lower than PRNN algorithm. For instance, in PRTO algorithm the error for rl5934 data instance equals 7%, while in PRNN algorithm the error is 18.2%; that is PRTO algorithm provides better solution quality than PRNN algorithm by 2.6 times. Thus, Table 14 shows the superiority of PRTO algorithm over the PRNN algorithm in achieving higher solution quality. As mentioned earlier, the solution quality is not influenced neither by the type of optoelectronic architecture nor by the number of processors, but it is influenced by the data size; therefore, the obtained solutions were similar on OTIS-HHC (G = P and \(G=P/2\)) and OTIS-MOT optoelectronic architectures for all network size ranges.

Table 14 PRNN and PRTO solution quality results over various data instances

Additional realistic comparison has been performed in terms of speedup and efficiency, where the selected algorithm to be compared with is presented in [38]. This algorithm is used to solve TSP using ant colony heuristic approach. The algorithm is a parallel version of the MMAS, which is implemented on the Sunway Blue Light supercomputer based on Message Passing Interface (MPI). In order to accomplish a fair comparison, OTIS-HHC (when \(G=P/2\)) optoelectronic architecture is selected since its number of processors in different size ranges approximately equal to the number of cores that have been selected in [38].

Table 15 shows the speedup and efficiency results of both PRTO and MMAS algorithms, where P represents the number of processors in OTIS-HHC (when \(G=P/2\)) and C represents the number of cores in Sunway Blue Light supercomputer. The data instances eil51, pr1002 and fnl4461 from the TSPLIB are selected to evaluate the performance of PRTO and MMAS algorithms. As shown in Table 15, PRTO recorded better speedup result than MMAS for large data instance. In particular, using data instance fnl4461, which contains 4461 cities, the speedup result for PRTO algorithm is 560.6 and for MMAS algorithm is 103; that is PRTO algorithm on OTIS-HHC achieved 5.44 times better speedup result than MMAS algorithm on Sunway Blue Light supercomputer. Whereas, using eil51 which is a small data instance that contains only 51 cities, the speedup of MMAS algorithm achieved 1.64 times better than PRTO algorithm, and using pr1002 data instance, which contains 1002 cities, the speedup of PRTO algorithm is slightly better than the speedup of MMAS algorithm. Thus, as the number of cities and number of processors increase the speedup of PRTO algorithm increases too.

Table 15 PRTO and MMAS speedup and efficiency results

The efficiency results depend on the number of processors or number of cores and on the achieved speedup; therefore, PRTO algorithm on OTIS-HHC (when \(G=P/2\)) architecture achieved better efficiency result using fnl4461 data instance, in comparison to MMAS algorithm on Sunway Blue Light supercomputer, while lower efficiency is achieved for PRTO algorithm on OTIS-HHC (when \(G=P/2\)) architecture using eil51 data instance. However, using pr1002 data instance, both PRTO and MMAS achieved almost the same efficiency, as shown in Table 15.

6 Conclusions

In summary, this paper presents a parallel version of the 2-opt algorithm for solving the TSP on two selected OTIS optoelectronic architectures namely: OTIS-HHC and OTIS-MOT. Four classes of network size ranges are introduced to perform a comparative evaluation of these optoelectronic architectures. These ranges fall between 16 to 2304 processors. Moreover, four phases for applying the parallel algorithm on the optoelectronic architectures of interest are presented. Each phase is analytically evaluated and simulated on each optoelectronic architecture. The performance of the PRTO algorithm is analytically evaluated in terms of parallel time complexity, speedup, efficiency, cost, and communication cost. Additionally, the performance of the PRTO algorithm is experimentally evaluated by simulation in terms of computation time, communication steps, communication time, parallel time, speedup, efficiency, cost, communication cost, optimization cost, and solution quality. The simulation runs are carried out on various sizes of TSPLIB data instances and the results achieved high speedup approaching to 32.96 on OTIS-HHC (when \(G=P\)) using 36 processors, and efficiency approaching 92%. Regarding the PRTO algorithm running on OTIS-HHC architecture of 1152 processors, the optimization cost increases as the number of cities increases using various sizes of data instances, this is due to the increasing time required by the PRTO algorithm to improve the solutions.

A detailed discussion of the performance evaluation using the selected performance metrics between OTIS-HHC and OTIS-MOT is accomplished. The results showed that OTIS-HHC is better than OTIS-MOT in terms of parallel running time, communication time, communication cost and speedup; particularly, when the network size becomes larger with higher number of processors. The differences in the number of processors between OTIS-HHC and OTIS-MOT contributes to the computation time, efficiency, and cost results. As demonstrated in the results discussion, OTIS-HHC \((G=P)\) achieved the best speedup among OTIS-HHC \((G=P/2)\) and OTIS-MOT architectures with a large gap; this is due to its superiority in providing high number of processors with less parallel run time.

A comparative study between the PRTO and PRNN algorithms is conducted, showing the superiority of PRTO over PRNN in terms of speedup, efficiency, and solution quality. For example, in the case of OTIS-HHC when \((G=P)\) for size ranges between 1152 and 2304 processors, the speedup of PRTO is 928.9 and the speedup of PRNN is 27; showing that PRTO is faster than PRNN by approximately 34 times. For the same case, the efficiency of PRTO is 40%, whereas the efficiency of PRNN is 1%. In addition, as a solution quality metric, the resulting error in PRTO for rl5934 data instance equals 7%, while it is 18.2% in PRNN.

Additional realistic comparison has been performed in terms of speedup and efficiency between the PRTO algorithm on OTIS-HHC (when \(G=P/2\)) optoelectronic architecture and MMAS algorithm on Sunway Blue Light supercomputer. The results showed the superiority of PRTO in terms of speedup and efficiency particularly for large data instance.

In this paper, we exploited the interesting features of OTIS optoelectronic architecture in solving the TSP using a parallel local search algorithm, achieving high speedup and efficiency. Thus, encouraging researchers to solve other problems investing this optoelectronic architecture.