1 Introduction

Global datacenter (DC) traffic is growing at a compound annual growth rate of 27% from 2016 to 2021, and 77% of this traffic resides within DCs [1]. Data-parallel-computing applications (DPCAs) such as big data, web search, artificial-intelligence applications, etc., are driving traffic growth [2, 3]. Most DPCAs are organized into multiple stages or have machines grouped by functionality. For example, MapReduce contains a series of successive computation and communication stages. As the scale of MapReduce increases, more data-parallel clusters are involved in a computation stage. Before the next computation stage starts, several parallel flows must be finished transmitting between two group of machines within a communication stage. The collection of flows in such a communication stage is called “Coflow” [4], which bears special characteristics, e.g., Coflow completion time (CCT) is determined by the completion time of the last-finished flow in the collection. Decreasing average CCT can lead to faster completion of corresponding job for DPCAs [4]. However, traditional scheduling algorithms that optimize the completion of an individual flow cannot reduce the overall latency for a Coflow because of the dependency of all parallel flows in a Coflow. So, Coflow needs new scheduling schemes to deal with its special characteristics.

Coflow scheduling and placement problem have recently attracted attention in research, and some prior works in this direction can be found. Many studies have investigated how to reduce CCT by optimizing Coflow scheduling, e.g., [5,6,7]. Baraat, an algorithm proposed in [5] uses local observations as indicator of CCTs resulting in poor performance for large-scale parallel clusters. Algorithm Varys [6] requires information of all flows and Coflows introducing high overheads for small Coflows. Aalo [7] applies an online approximate Shortest-Job-First algorithm which leads to unfairness for large-size Coflows.

On the other hand, most recent works do not consider Coflow placement optimization, i.e., the locations of parallel flows in a Coflow are preset, or Coflows are either randomly placed at senders and receivers (Random Placement). But average CCT is jointly determined by both Coflow placement and scheduling. So far, few works have proposed Coflow placement algorithms. Authors in [8] propose a 2D-placement algorithm for Coflow, which considers the size of Coflows and works independently under any scheduling algorithm. The placement algorithm “Neat” proposed in [9] uses states of all flows of all Coflows and information about their traffic priority under a specific network scheduling policy, which requires considerable storage and computation resources as the size of DC scales. As these Coflow placement approaches are unaware of underlying switching architecture and which scheduling algorithms are adopted in practice, only optimizing Coflow placement may not result in significant improvement on reducing CCT.

Traditional DC architectures are based on electrical switching. This poses (and will soon even more dramatically pose) a set of challenges to modern DC operators, especially for those running DPCAs, in terms of fast exhaustion of transmission bandwidth, high power consumption, high latency, and low scalability. For example, considering that today DCs need to support bandwidths in the order of Tbps, the power consumption of a single switch could reach 2400 W [10]. Also, the power consumption scales up with the size of the DC network, leading to higher energy cost. Besides, DPCAs have diverse computing requirements, e.g., some of these DPCAs are throughput-sensitive while some are delay-sensitive. Considering large data-parallel computing, DPCAs expect a broad spectrum of services from DC network. Thus, optical network could be a promising solution to modern DCs to support these frameworks over large-scale data-parallel clusters.

Wavelength-division multiplexing (WDM), which has been applied for a long time in backbone optical networks, can provide multiple concurrent channels to avoid head-of-line blocking, achieves lower latency and lower packet loss compared to traditional electrical switches. But WDM may not provide a large number of fine-grained channels to satisfy large-scale DC traffic requirements. Optical packet switching (OPS) has been proposed as a promising technique in intra-datacenter networks thanks to energy-efficient optical components. For example, Arrayed Waveguide Grating Routers (AWGRs) can work with fast-tuning lasers to significantly reduce power consumption of packet switching interconnects for DCs. Several OPS architectures have been proposed in [11,12,13,14], which are based on re-arrangeable, non-blocking, and multistage models, such as Benes or Banyan. However, when the interconnection pattern of intra-datacenter network requires reconfiguration, the optical packet switch is subject to a reconfiguration time during which service interruption could occur, possibly giving rise to packet loss. The time complexity of reconfiguration in these architectures is \(O\left( {Nlog_{2} N} \right)\), where \(N\) is the number of input and output ports [12].

To reduce reconfiguration time, recently, a simple and scalable OPS architecture, called Packet-Switched Optical Network (PSON), has been proposed in [15]. PSON is based on a one-stage core switching architecture, which reduces the scheduling complexity. Furthermore, in PSON, data plane and control plane are decoupled so that control plane does scheduling independently to make data plane keep “always-on” (i.e., transmit data without interrupt). Also, PSON uses space switch to connect with multiple AWGRs making scalability easy for intra-datacenter interconnects. In addition, the decoupling of data plane and control plane eases the implementation of advanced scheduling algorithms that could be customized for different applications. Hence, PSON is a suitable DC network architecture for DPCAs.

Going beyond the contributions in the aforementioned prior works [5,6,7,8,9], in this study, we consider that CCT minimization problem can be further studied from two other aspects. First, Coflow scheduling and placement algorithms need to consider the specific structure of the underlying switching architecture interconnecting the parallel-data clusters, in our case, the PSON. Second, the scheduling algorithm and placement algorithm will cooperate to jointly optimize the CCT for a Coflow. So, we propose a PSON-aware Coflow placement and scheduling (PACPS) algorithm, which consists of two stages. First, we identify a priority-aware scheduling based on flows’ characteristics of Coflows (such as the length of a Coflow, number of flows in a Coflow, size of total flows in a Coflow, etc.), and on the features of PSON (e.g., size of a queue, number of flows in a queue, delay of first flow in a queue, space switching time, etc.). Second, we design a Coflow placement algorithm which cooperates with scheduling algorithm to reduce average CCT.

This study is organized as following. In Sect. 2, we introduce the Packet-Switched Optical Network datacenter architecture and the Coflow characteristics. In Sect. 3, we describe the two-stage PSON-aware Coflow placement and scheduling algorithm. In Sect. 4, we quantitatively investigate the performance of the PACPS algorithm and compare it with state-of-the-art algorithms through simulation. Section 5 concludes this study.

2 Coflow characteristics and optical datacenter architecture

2.1 Coflow Characteristics

A Coflow represents a collection of independent flows that share a common performance goal. Coflow performance depends on Coflow completion time (CCT), which is the duration to finish transmission for all flows of a Coflow [6].

To design efficient placement and scheduling algorithms for Coflow, we need to understand its characteristics. As a Coflow consists of multiple parallel flows, it cannot be characterized by its size only. Thus, authors in [6] defined the “length” of a Coflow as the size of its largest flow in bytes, and the “width” as the number of parallel flows. For the length dimension, they describe a Coflow as “short” if its length ≤ 1 MB, or “long” otherwise. For the width dimension, if a Coflow contains fewer than 50 flows, it is considered as “narrow”, otherwise, the Coflow is “wide”. Hence, based on length and width, Coflows can be classified into 4 bins, i.e., short and narrow, long and narrow, short and wide, and long and wide. They analyzed data traces from a warehouse at Facebook [16], with 3000 machines and 150 racks observing that more than 40% Coflows are short, meaning only some Coflows have flows that can be very large. Similarly, more than 60% Coflows are narrow, while only a small number of Coflows consists of millions of flows. They also find senders to be bottlenecks for almost a third of Coflows by studying the ratio of number of senders and receivers for each Coflow which confirms the importance of Coflow placement.

2.2 Packet-switched optical network

To design a more efficient Coflow placement scheduling algorithm, the knowledge of underlying architecture of DC networks is also necessary. In this work, we consider Packet-Switched Optical Network (PSON) as a promising architecture solution for intra-DC networks thanks to its excellent scalability, low-latency, and energy efficiency [15].

In PSON, data and control planes are decoupled from actual optical devices and logically centralized. Data plane takes charge of packet switching, while control plane manages scheduling of photonic frame switching in each time slot.

As shown in Fig. 1a, the core of PSON architecture is composed of an array of AWGRs, which are passive and high-speed optical devices. AWGR can achieve contention resolution in the wavelength domain and reduce energy consumption [15]. In PSON, to a top-of-rack (ToR) switch connects with multiple servers, and \(n\) such ToR switches are interconnected by the AWGR-based core switching network. Each ToR switch contains an ingress module and an egress module. An ingress module consists of n virtual output queues (VoQs), each of which corresponds to another ToR switch as the destination. A \(1 \times m\) space switch connects an ingress module with \(m\) AWGRs, and an AWGR exclusively delivers photonic frames to \(n/m\) egress modules. The scalability of PSON is enhanced by using multiple \(1 \times m\) space switches so that the number of switch ports can be easily enlarged without considerably increasing latency. Figure 1b illustrates an example of applying \(1 \times 2\) space switches to increase the number of supported ToR switches from 40 to 80. Since ToR does not mind sending signaling packets to each other, sophisticated scheduling decisions can be made on a timely manner to reduce switching latency, and precious switching bandwidth can be saved for serving customers’ data.

Fig. 1
figure 1

a PSON architecture. b PSON data plane (with optical switch fabric). c Photonic frame format

3 Coflow-aware placement and scheduling algorithm design

3.1 Scheduling algorithm

With the knowledge of underlying DC architecture, existing Coflow scheduling algorithms, such as Baraat [5], Varys [6], Aalo [7], are not suitable for PSON, because they neither consider Coflow characteristics nor cooperate with a correlated Coflow placement algorithm. 2D-placement algorithm only considers the total size of each Coflow while ignoring other network performance measurement.

Fortunately, Priority-aware (PA) scheduling algorithm proposed for PSON architecture in [15] is able to overcome above weakness. PA algorithm uses the priorities of VoQs to reflect and handle different flow-level characteristics of Coflow. The priority in PA is obtained based on the combination of four strategies: i.e., longest queue first (LQF), largest number of flows first (LNFF), oldest flow first (OFF), and less space switch tuning first (LSSTF), which are weighted as shown in Eq. (1).

$$W_{ij} = l_{ij} *w_{l} + n_{ij} *w_{n} + d_{ij} *w_{d} + s_{ij} *w_{s}$$
(1)

Here \(W_{ij}\) represents the priority weight calculated for \(VoQ_{ij}\) (the virtual queue at ingress node \(i\) associated with egress node \(j\)), \(l_{ij}\) is the length of \(VoQ_{ij}\), \(n_{ij}\) represents the number of flows in \(VoQ_{ij}\), \(d_{ij}\) is the delay of the earliest flow in \(VoQ_{ij}\), and \(w_{l}\), \(w_{n}\), and \(w_{d}\) are weighting factors for them. Weighting factors are normalized into [0, 1] domain based on their associated impact factors. \(s_{ij}\) is a Boolean value (taking value 0 if sending a frame of \(VoQ_{ij}\) needs tuning of the space switch; 1, otherwise). Note that this entire scheduling process of PA algorithm is executed in a separate centralized controller instead of in each port. With VoQs’ priority information, centralized controller can decide which nodes should be given grants (if possible) to transmit t in the traffic next time slot. In this way, PA algorithm takes multiple network measurements into consideration so as to provide better network performance.

3.2 Min-priority placement algorithm design

We design a Coflow placement algorithm, called Min-Priority (MP), using the same ‘priority’ conception as PA scheduling algorithm so that Coflow placement algorithm can cooperate with flow scheduling algorithm to jointly reduce CCT.

First, we present an example to provide some insights for the Coflow placement problem. To simplify the demonstration, in this example, we temporarily use flow size as the only weight factor to decide “priority” of each node. We consider a non-blocking, two-way fully-connected network containing three physical DC nodes, represented by \(3*3\) matrix (shown at the top right corner in Fig. 2a).

Fig. 2
figure 2

Coflow placement motivation example

Since a new Coflow \(C_{n}\) consists of multiple parallel flows, we use (\(s_{i}\), \(r_{j}\), \(d_{i,j}\)) to describe each parallel flow. Here, \(s_{i}\) represents logical sender \(i\), \(r_{j}\) represents logical receiver \(j\), and \(d_{i,j}\) represents traffic demand from logical sender \(i\) to receiver \(j\). As new Coflow \(C_{n}\) contains only three flows in our example, which include two logical senders and two logical receivers in total, and a flow’s logical sender cannot have the same physical node as its receiver, Coflow locality constraints are represented by a 2*2 red matrix \(D\) (shown at the left bottom corner Fig. 2a). The number in a block is \(d_{i,j}\). All parallel flows in one row of \(D\) should be transmitted by the same logical sender, and all parallel flows in one column of \(D\) should be sent to the same receiver. Figure 2a shows the current network state when a new Coflow \(C_{n}\) comes, and yellow blocks represent the size of existing flows on a physical node. In Fig. 2b–g, we list all possibilities of the new Coflow placement for \(s_{1}\), \(s_{2}\), \(r_{1}\) and \(r_{2}\), and give CCT for each placement. Here, gradient-color blocks mean the nodes will transmit both existing Coflows and new Coflow \(C_{n}\) after placement. The bottleneck logical sender for new Coflow \(C_{n}\) is \(s_{1}\) which needs to send 20-size flows in total, while bottleneck receiver is \(r_{2}\) which needs to receive 25-size flows in total. We observe that, to minimize CCT, we must avoid placing new Coflow’s bottleneck logical sender or receiver (endpoint) onto a physical node with insufficient inbound (or outbound) bandwidth, because such bottleneck endpoint(s) are more likely to delay CCT than other non-bottleneck endpoints. For example, in Fig. 2, we should place \(C_{n}\)’s logical bottleneck receiver \(r_{2}\) on DC physical egress node \(eg_{1}\) which has sufficient bandwidth to get minimum CCT (Fig. 2g). Because existing Coflows already claim some bandwidth from \(eg_{2}\) and \(eg_{3}\), placing bottleneck receiver \(r_{2}\) on any of these busier nodes (Fig. 2b–e) would delay \(C_{n}\)’s bottleneck and extend CCT.

figure f

In contrast, placing non-bottleneck receiver \(r_{1}\) on a busier node \(eg_{2}\) is less harmful, as shown in Fig. 2g, because non-bottleneck receiver \(r_{1}\) can tolerate certain amount of delay as long as its delay does not exceed the delay of bottleneck receivers \(r_{2}\) and does not increase CCT.

We also note that a traditional flow-level placement algorithm may be suboptimal for Coflow. For example, authors in [9] place a Coflow’s constituent flows sequentially in decreasing order of their flow sizes, because “large flows are more likely to be critical to determine CCT” [9]. However, such a strategy might yield suboptimal solutions, as shown in Fig. 2d, f. Under this flow-level strategy, \(C_{n}\)’s largest flow \(f_{2,2} = 15\) is most critical and may occupy the empty ingress node \(ing_{3}\) shown in Fig. 2d, f. This placement leaves suboptimal ports of \(ing_{1} { }\) or \(ing_{2}\) for the bottleneck logical sender \(s_{1}\) of \(C_{n}\), leading to higher CCT compared with Fig. 2g.

On the other hand, as CCT is jointly determined by both Coflow placement and scheduling, if we can take scheduling algorithm into account when designing Coflow placement, their cooperation can benefit to reduce CCT. As PA scheduling algorithm in Section III.B uses four combined weight factors to represent the priority of a node, we design a Min-Priority (MP) Coflow placement algorithm which not only uses priority weight from PA but also follows from the motivation example in Fig. 2 to put bottleneck logical endpoint(s) of a Coflow at DC node(s) with minimum priority weight. Reference [6] found logical senders to be bottlenecks for almost a third of Coflows by studying the ratio of number of senders and receivers for each Coflow. So, we will first assign physical nodes to logical senders and then to logical receivers.

To enable the cooperation among the proposed MP placement algorithm and the PA algorithm, we use priority of each node to characterize Coflow contentions, as explained below. In Algorithm 1, we use \(e_{i}^{ing}\) to represent existing priority of ingress node \(i\), defined as sum of priorities of all VoQs in ingress node \(i\) (\(e_{i}^{ing} = \mathop \sum \limits_{j = 1}^{N} W_{ij}\)), where Wi,j comes from Eq. (1). Similarly, \(e_{j}^{eg}\) represents priority for egress node \(j\), defined as sum of priorities of each VoQ sending flows to egress node \(j\) from different ingress nodes (\(e_{j}^{eg} = \mathop \sum \limits_{i = 1}^{N} W_{ij}\)). We also use vectors \(E^{ing} \left[ . \right]\) and \(E^{eg} \left[ . \right]\) to represent the set of \(e_{i}^{ing}\) and \(e_{j}^{eg}\), respectively. We use vector \(L^{s} \left[ {s_{1} , \ldots , s_{N} } \right]\) and \(L^{r} \left[ {r_{1} , \ldots , r_{M} } \right]\) to indicate the weights which are going-to-be added to ingress (egress) nodes due to new Coflow demand requested on each logical sender (receiver) of \(C_{n}\) (Lines 2 and 3 in Algorithm 1). The asymptotic complexity for MP Coflow placement algorithm is \(O\left( M \right)\), where M is the number of flows in a Coflow.

Our proposed MP placement algorithm consists of two steps (see Algorithm 1). When a new Coflow comes, MP first iteratively updates going-to-be-added priority in \(L^{s} \left[ {s_{i} } \right]\) and \(L^{r} \left[ {r_{j} } \right]\) for each logical sender and receiver (in Lines 2 and 3) by adding \(d_{i,j} *w_{l} + 1*w_{n}\), where \(w_{l}\) and \(w_{n}\) are size and number of flows weighting factors from Eq. (1). Second, MP considers each logical sender (or receiver) in descending order of their going-to-be-added priority, and then places logical sender (or receiver) onto ingress (or egress) nodes which are unlikely to delay a new Coflow’s bottleneck.

4 Illustrative numerical examples

4.1 Simulation settings

Our simulated workload is generated based on a data traces at Facebook [16] that were collected on a 3000-machine cluster with 150 racks. To evaluate the performance of our proposed PACPS algorithm, we conduct large-scale numerical experiment using a trace-driven simulator, which replays the Facebook trace. We obtain different traffic loads by adjusting the Coflow size using a scale factor, while preserving Coflow structure, input-to-output ratio of tasks, locality constraints, and inter-arrival times between jobs.

According to Chowdhury et al. [6], Coflows can be classified according to length (short vs. long) and width (narrow vs. wide). We consider a Coflow to be short, if its largest flow is less than 1 MB, and narrow, if it involves less than 50 flows. Vice versa for a long and wide flow. Hence, based on length and width, we consider 4 bins of Coflow, i.e., short and narrow (SN), long and narrow (LN), short and wide (SW), and long and wide (LW) (shown in Table 1a). To study the impact of PACPS algorithm on different Coflow mixes, we generate four mixes of Coflows (i.e., Mix-N, Mix-W, Mix-S, and Mix-L) by combining different types of Coflows from the 4 bins in Table 1a. The four mixes of Coflows and their components are shown in Table 1b. For example, Mix-N contains 800 Coflows of which 48% are SN Coflows, 40% are LN Coflows, 2% are SW Coflows, and 10% are WL Coflows. Simulating on these four Coflows mixes, we can investigate the performance of PACPS on different scenarios.

Table 1 Coflow bins and mixes

We use “gap from optimal (GFO)” as the performance metric. GFO is calculated as \(\left( {CCT^{real} - CCT^{opt} } \right)/CCT^{opt}\), where \(CCT^{real}\) is real completion time incurred in our simulation and \(CCT^{opt}\) is the time to complete Coflow transfer in an empty network (i.e., there is no other flow competing resource in network). The gap from the optimal time tells us the margin for performance improvement for different kinds of Coflow placement and scheduling algorithms. In addition, to study the gain of Coflow placement, we compare our proposed PA placement scheme with the case when there is no Coflow placement optimization, i.e., assigning a Coflow to a random sender and receiver, referred as “Random Placement”.

As some parallel computing applications are delay-sensitive, we study how PACPS algorithm helps Coflows meet their deadlines. To do deadline-constrained experiments, we set the deadline of a Coflow to be its minimum CCT in an empty network \(t^{empty}\) multiplied by \(\left( {1{ } + {\text{ U}}\left( {0,{\text{ x}}} \right)} \right)\), where \({\text{U}}\left( {0,{\text{ x}}} \right)\) is a uniform random number between 0 and \(x\). Unless otherwise specified, \(x = 1\) in this paper. We consider 200 ms as the minimum deadline as in [6].

Our simulation is based on a PSON with 80 ToR switches, \(40 \times 40\) AWGRs, and \(1 \times 2\) space switches. Each ToR received input traffic generated by trace-driven simulator mentioned above. The transmission speed of a wavelength was 10 Gbps, and each ingress node had a buffer shared by all VoQs.

4.2 Numerical results analysis

In Fig. 3, we compare the GFO of Min-Priority placement algorithm to Min-Load and Min-Num placement algorithms. All of the three algorithms are combined with PA scheduling algorithm. For Min-Load placement, only the size of flows in ingress nodes is considered for priority when placing a new Coflow, which is actually the same as 2D-placement algorithm in [8], i.e., placing the largest Coflow first. For Min-Num placement, only the number of flows in ingress nodes is considered for priority when doing Coflow placement, i.e., placing the narrowest Coflow first. Under light load with scale factor < 0.25, the three placement algorithms have similar performance, because there is sufficient bandwidth for new Coflows in the network. So, CCT is less sensitive to Coflow placement. Under medium or heavy workload, the placement of current Coflows is critical to improve performance of incoming Coflows. Min-Priority placement algorithm outperforms Min-Load and Min-Num placement algorithms under various scale factors, as it effectively cooperates with PA scheduling algorithm. This result demonstrates the benefit of joint optimization of Coflow placement and scheduling algorithms.

Fig. 3
figure 3

GFO of Min-Priority, Min-Load, and Min-Num placement algorithms

Figure 4 shows the GFO of applying Min-Priority placement algorithm applying with three different scheduling algorithms namely, Priority-Aware (PA), Aalo, and Varys. PA scheduling always achieves the best performance in terms of GFO compared to the Aalo and Varys, as these last two are designed with fixed-placement. This result further confirms that Coflow performance can benefit from the cooperation between jointly-optimized Coflow placement and scheduling algorithms.

Fig. 4
figure 4

GFO of PA, Aalo, and Varys scheduling algorithms

In Fig. 5, we compare GFO of Min-Priority placement with Min-Num and Min-Load algorithms running on four different mixes. We use PA as default scheduling algorithm here. Min-Priority still achieves the best performance for each Coflow mix. For Mix-S, as more than 50% of Coflows are short and narrow, number of flows plays a more important role than it in other Coflow mixes. As Min-Num scheme considers the number of Coflows and thus is suitable for Mix-S scenario, it can obtain similar GFO with Min-Load. For Mix-L, since we have 58% long Coflows, load is more important than the number of flows. Therefore, the merit of Min-Load is more significant than Min-Num in terms of GFO gain.

Fig. 5
figure 5

GFO of Min-Priority, Min-Load, and Min-Num algorithms for four extreme Coflow mixe

Figure 6 shows the percentage of Coflows that meet their deadlines when applying Min-Priority placement algorithm compared with Random Placement. For default scale factor 1, we note that Min-Priority guarantees at least 61.3% Coflows to meet their deadlines for a strict deadline constraint (i.e., \(x = 0.1\)). As deadline requirement becomes even more flexible, Min-Priority placement can help more Coflows to finish on time, compared with random placement. This confirms the importance of Coflow placement on minimizing CCT.

Fig. 6
figure 6

% Coflows that meet deadline of Min-Priority placement and Random placement

5 Conclusion

An increasing number of DPCAs is driving the requirement for scalable, low-latency, high-speed, and energy-efficient datacenter networks. In this study, we consider the PSON as an interconnect solution for data-parallel clusters within datacenters. To support concurrent flows between data-parallel clusters, known as Coflow, and to minimize CCT, we propose a PACPS, which consists of two cooperating algorithms, the Min-Priority placement algorithm and the PA scheduling algorithm. PA scheduling is designed based on flow-level characteristics of Coflows and PSON architecture, and Min-Priority placement is designed to cooperate with PA to jointly minimize CCT. We numerically investigated the benefits of joint design of Coflow placement and scheduling by comparing to other algorithms designed without this kind of cooperation. Simulation results show that joint design of placement and scheduling can achieve smaller “Gap from optimal CCT”. We also investigated how PACPS outperforms other algorithms for different mixes of Coflows and how PACPS can help Coflows meet their deadlines.