Coflow scheduling and placement for packet-switched optical datacenter networks

Wang, Lin; Wang, Xinbo; Tornatore, Massimo; Kim, Kwangjoon; Mukherjee, Biswanath

doi:10.1007/s11107-021-00958-4

Coflow scheduling and placement for packet-switched optical datacenter networks

Original Paper
Published: 14 February 2022

Volume 43, pages 156–164, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Photonic Network Communications Aims and scope Submit manuscript

Coflow scheduling and placement for packet-switched optical datacenter networks

Download PDF

Lin Wang ORCID: orcid.org/0000-0001-7248-2844¹,
Xinbo Wang²,
Massimo Tornatore^1,3,
Kwangjoon Kim⁴ &
…
Biswanath Mukherjee¹

228 Accesses
Explore all metrics

Abstract

Data-parallel computing applications (DPCAs) (e.g., MapReduce, web search, etc.) are driving the need of scalable, low-latency, high-speed, and energy-efficient datacenters, because a DPCA consists of a series of heavy-computation stages within a datacenter, and each stage contains multiple parallel flows that must be completed before next stage starts, referred to as “Coflow”. These parallel flows are grouped as a Coflow. Coflow is a networking abstraction to convey application-level communication requirements by exposing rich semantics of DPCAs to underlying networks, e.g., latency of data transmission between two computation stages, known as “Coflow Completion Time” (CCT). Packet-switched optical network (PSON) is a practical intra-datacenter interconnect solution for DPCAs, as it is designed as a low-complexity and scalable one-stage switching architecture, using advanced optical networking technologies, such as Arrayed Waveguide Grating Routers and wavelength-division multiplexing. In this work, we study how to minimize CCT in PSON-enabled datacenters by placing senders and receivers of Coflows to proper transceiver nodes and scheduling data transmission wisely, for which we propose a Coflow-aware placement and scheduling algorithm, consisting of Min-Priority placement algorithm and Priority-aware scheduling algorithm. They are designed to cooperate with each other to jointly minimize CCT. Numerical simulations demonstrate the benefits of joint design of Coflow placement and scheduling algorithm, compared to state-of-the-art scheduling and placement algorithms designed without correlation.

Dynamic Priority Coflow Scheduling in Optical Circuit Switched Networks

A high-throughput energy-efficient passive optical datacenter network

Article 25 August 2016

Optical Switching in Data Centers: Architectures Based on Optical Packet/Burst Switching

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Global datacenter (DC) traffic is growing at a compound annual growth rate of 27% from 2016 to 2021, and 77% of this traffic resides within DCs [1]. Data-parallel-computing applications (DPCAs) such as big data, web search, artificial-intelligence applications, etc., are driving traffic growth [2, 3]. Most DPCAs are organized into multiple stages or have machines grouped by functionality. For example, MapReduce contains a series of successive computation and communication stages. As the scale of MapReduce increases, more data-parallel clusters are involved in a computation stage. Before the next computation stage starts, several parallel flows must be finished transmitting between two group of machines within a communication stage. The collection of flows in such a communication stage is called “Coflow” [4], which bears special characteristics, e.g., Coflow completion time (CCT) is determined by the completion time of the last-finished flow in the collection. Decreasing average CCT can lead to faster completion of corresponding job for DPCAs [4]. However, traditional scheduling algorithms that optimize the completion of an individual flow cannot reduce the overall latency for a Coflow because of the dependency of all parallel flows in a Coflow. So, Coflow needs new scheduling schemes to deal with its special characteristics.

Coflow scheduling and placement problem have recently attracted attention in research, and some prior works in this direction can be found. Many studies have investigated how to reduce CCT by optimizing Coflow scheduling, e.g., [5,6,7]. Baraat, an algorithm proposed in [5] uses local observations as indicator of CCTs resulting in poor performance for large-scale parallel clusters. Algorithm Varys [6] requires information of all flows and Coflows introducing high overheads for small Coflows. Aalo [7] applies an online approximate Shortest-Job-First algorithm which leads to unfairness for large-size Coflows.

On the other hand, most recent works do not consider Coflow placement optimization, i.e., the locations of parallel flows in a Coflow are preset, or Coflows are either randomly placed at senders and receivers (Random Placement). But average CCT is jointly determined by both Coflow placement and scheduling. So far, few works have proposed Coflow placement algorithms. Authors in [8] propose a 2D-placement algorithm for Coflow, which considers the size of Coflows and works independently under any scheduling algorithm. The placement algorithm “Neat” proposed in [9] uses states of all flows of all Coflows and information about their traffic priority under a specific network scheduling policy, which requires considerable storage and computation resources as the size of DC scales. As these Coflow placement approaches are unaware of underlying switching architecture and which scheduling algorithms are adopted in practice, only optimizing Coflow placement may not result in significant improvement on reducing CCT.

Traditional DC architectures are based on electrical switching. This poses (and will soon even more dramatically pose) a set of challenges to modern DC operators, especially for those running DPCAs, in terms of fast exhaustion of transmission bandwidth, high power consumption, high latency, and low scalability. For example, considering that today DCs need to support bandwidths in the order of Tbps, the power consumption of a single switch could reach 2400 W [10]. Also, the power consumption scales up with the size of the DC network, leading to higher energy cost. Besides, DPCAs have diverse computing requirements, e.g., some of these DPCAs are throughput-sensitive while some are delay-sensitive. Considering large data-parallel computing, DPCAs expect a broad spectrum of services from DC network. Thus, optical network could be a promising solution to modern DCs to support these frameworks over large-scale data-parallel clusters.

Wavelength-division multiplexing (WDM), which has been applied for a long time in backbone optical networks, can provide multiple concurrent channels to avoid head-of-line blocking, achieves lower latency and lower packet loss compared to traditional electrical switches. But WDM may not provide a large number of fine-grained channels to satisfy large-scale DC traffic requirements. Optical packet switching (OPS) has been proposed as a promising technique in intra-datacenter networks thanks to energy-efficient optical components. For example, Arrayed Waveguide Grating Routers (AWGRs) can work with fast-tuning lasers to significantly reduce power consumption of packet switching interconnects for DCs. Several OPS architectures have been proposed in [11,12,13,14], which are based on re-arrangeable, non-blocking, and multistage models, such as Benes or Banyan. However, when the interconnection pattern of intra-datacenter network requires reconfiguration, the optical packet switch is subject to a reconfiguration time during which service interruption could occur, possibly giving rise to packet loss. The time complexity of reconfiguration in these architectures is $O\left( {Nlog_{2} N} \right)$, where $N$ is the number of input and output ports [12].

To reduce reconfiguration time, recently, a simple and scalable OPS architecture, called Packet-Switched Optical Network (PSON), has been proposed in [15]. PSON is based on a one-stage core switching architecture, which reduces the scheduling complexity. Furthermore, in PSON, data plane and control plane are decoupled so that control plane does scheduling independently to make data plane keep “always-on” (i.e., transmit data without interrupt). Also, PSON uses space switch to connect with multiple AWGRs making scalability easy for intra-datacenter interconnects. In addition, the decoupling of data plane and control plane eases the implementation of advanced scheduling algorithms that could be customized for different applications. Hence, PSON is a suitable DC network architecture for DPCAs.

Going beyond the contributions in the aforementioned prior works [5,6,7,8,9], in this study, we consider that CCT minimization problem can be further studied from two other aspects. First, Coflow scheduling and placement algorithms need to consider the specific structure of the underlying switching architecture interconnecting the parallel-data clusters, in our case, the PSON. Second, the scheduling algorithm and placement algorithm will cooperate to jointly optimize the CCT for a Coflow. So, we propose a PSON-aware Coflow placement and scheduling (PACPS) algorithm, which consists of two stages. First, we identify a priority-aware scheduling based on flows’ characteristics of Coflows (such as the length of a Coflow, number of flows in a Coflow, size of total flows in a Coflow, etc.), and on the features of PSON (e.g., size of a queue, number of flows in a queue, delay of first flow in a queue, space switching time, etc.). Second, we design a Coflow placement algorithm which cooperates with scheduling algorithm to reduce average CCT.

This study is organized as following. In Sect. 2, we introduce the Packet-Switched Optical Network datacenter architecture and the Coflow characteristics. In Sect. 3, we describe the two-stage PSON-aware Coflow placement and scheduling algorithm. In Sect. 4, we quantitatively investigate the performance of the PACPS algorithm and compare it with state-of-the-art algorithms through simulation. Section 5 concludes this study.

2 Coflow characteristics and optical datacenter architecture

2.1 Coflow Characteristics

A Coflow represents a collection of independent flows that share a common performance goal. Coflow performance depends on Coflow completion time (CCT), which is the duration to finish transmission for all flows of a Coflow [6].

To design efficient placement and scheduling algorithms for Coflow, we need to understand its characteristics. As a Coflow consists of multiple parallel flows, it cannot be characterized by its size only. Thus, authors in [6] defined the “length” of a Coflow as the size of its largest flow in bytes, and the “width” as the number of parallel flows. For the length dimension, they describe a Coflow as “short” if its length ≤ 1 MB, or “long” otherwise. For the width dimension, if a Coflow contains fewer than 50 flows, it is considered as “narrow”, otherwise, the Coflow is “wide”. Hence, based on length and width, Coflows can be classified into 4 bins, i.e., short and narrow, long and narrow, short and wide, and long and wide. They analyzed data traces from a warehouse at Facebook [16], with 3000 machines and 150 racks observing that more than 40% Coflows are short, meaning only some Coflows have flows that can be very large. Similarly, more than 60% Coflows are narrow, while only a small number of Coflows consists of millions of flows. They also find senders to be bottlenecks for almost a third of Coflows by studying the ratio of number of senders and receivers for each Coflow which confirms the importance of Coflow placement.

2.2 Packet-switched optical network

To design a more efficient Coflow placement scheduling algorithm, the knowledge of underlying architecture of DC networks is also necessary. In this work, we consider Packet-Switched Optical Network (PSON) as a promising architecture solution for intra-DC networks thanks to its excellent scalability, low-latency, and energy efficiency [15].

In PSON, data and control planes are decoupled from actual optical devices and logically centralized. Data plane takes charge of packet switching, while control plane manages scheduling of photonic frame switching in each time slot.

As shown in Fig. 1a, the core of PSON architecture is composed of an array of AWGRs, which are passive and high-speed optical devices. AWGR can achieve contention resolution in the wavelength domain and reduce energy consumption [15]. In PSON, to a top-of-rack (ToR) switch connects with multiple servers, and $n$ such ToR switches are interconnected by the AWGR-based core switching network. Each ToR switch contains an ingress module and an egress module. An ingress module consists of n virtual output queues (VoQs), each of which corresponds to another ToR switch as the destination. A $1 \times m$ space switch connects an ingress module with $m$ AWGRs, and an AWGR exclusively delivers photonic frames to $n/m$ egress modules. The scalability of PSON is enhanced by using multiple $1 \times m$ space switches so that the number of switch ports can be easily enlarged without considerably increasing latency. Figure 1b illustrates an example of applying $1 \times 2$ space switches to increase the number of supported ToR switches from 40 to 80. Since ToR does not mind sending signaling packets to each other, sophisticated scheduling decisions can be made on a timely manner to reduce switching latency, and precious switching bandwidth can be saved for serving customers’ data.

3 Coflow-aware placement and scheduling algorithm design

3.1 Scheduling algorithm

With the knowledge of underlying DC architecture, existing Coflow scheduling algorithms, such as Baraat [5], Varys [6], Aalo [7], are not suitable for PSON, because they neither consider Coflow characteristics nor cooperate with a correlated Coflow placement algorithm. 2D-placement algorithm only considers the total size of each Coflow while ignoring other network performance measurement.

Fortunately, Priority-aware (PA) scheduling algorithm proposed for PSON architecture in [15] is able to overcome above weakness. PA algorithm uses the priorities of VoQs to reflect and handle different flow-level characteristics of Coflow. The priority in PA is obtained based on the combination of four strategies: i.e., longest queue first (LQF), largest number of flows first (LNFF), oldest flow first (OFF), and less space switch tuning first (LSSTF), which are weighted as shown in Eq. (1).

$$W_{ij} = l_{ij} *w_{l} + n_{ij} *w_{n} + d_{ij} *w_{d} + s_{ij} *w_{s}$$

(1)

Here $W_{ij}$ represents the priority weight calculated for $VoQ_{ij}$ (the virtual queue at ingress node $i$ associated with egress node $j$), $l_{ij}$ is the length of $VoQ_{ij}$, $n_{ij}$ represents the number of flows in $VoQ_{ij}$, $d_{ij}$ is the delay of the earliest flow in $VoQ_{ij}$, and $w_{l}$, $w_{n}$, and $w_{d}$ are weighting factors for them. Weighting factors are normalized into [0, 1] domain based on their associated impact factors. $s_{ij}$ is a Boolean value (taking value 0 if sending a frame of $VoQ_{ij}$ needs tuning of the space switch; 1, otherwise). Note that this entire scheduling process of PA algorithm is executed in a separate centralized controller instead of in each port. With VoQs’ priority information, centralized controller can decide which nodes should be given grants (if possible) to transmit t in the traffic next time slot. In this way, PA algorithm takes multiple network measurements into consideration so as to provide better network performance.

3.2 Min-priority placement algorithm design

We design a Coflow placement algorithm, called Min-Priority (MP), using the same ‘priority’ conception as PA scheduling algorithm so that Coflow placement algorithm can cooperate with flow scheduling algorithm to jointly reduce CCT.

First, we present an example to provide some insights for the Coflow placement problem. To simplify the demonstration, in this example, we temporarily use flow size as the only weight factor to decide “priority” of each node. We consider a non-blocking, two-way fully-connected network containing three physical DC nodes, represented by $3*3$ matrix (shown at the top right corner in Fig. 2a).

Since a new Coflow $C_{n}$ consists of multiple parallel flows, we use ($s_{i}$, $r_{j}$, $d_{i,j}$) to describe each parallel flow. Here, $s_{i}$ represents logical sender $i$, $r_{j}$ represents logical receiver $j$, and $d_{i,j}$ represents traffic demand from logical sender $i$ to receiver $j$. As new Coflow $C_{n}$ contains only three flows in our example, which include two logical senders and two logical receivers in total, and a flow’s logical sender cannot have the same physical node as its receiver, Coflow locality constraints are represented by a 2*2 red matrix $D$ (shown at the left bottom corner Fig. 2a). The number in a block is $d_{i,j}$. All parallel flows in one row of $D$ should be transmitted by the same logical sender, and all parallel flows in one column of $D$ should be sent to the same receiver. Figure 2a shows the current network state when a new Coflow $C_{n}$ comes, and yellow blocks represent the size of existing flows on a physical node. In Fig. 2b–g, we list all possibilities of the new Coflow placement for $s_{1}$, $s_{2}$, $r_{1}$ and $r_{2}$, and give CCT for each placement. Here, gradient-color blocks mean the nodes will transmit both existing Coflows and new Coflow $C_{n}$ after placement. The bottleneck logical sender for new Coflow $C_{n}$ is $s_{1}$ which needs to send 20-size flows in total, while bottleneck receiver is $r_{2}$ which needs to receive 25-size flows in total. We observe that, to minimize CCT, we must avoid placing new Coflow’s bottleneck logical sender or receiver (endpoint) onto a physical node with insufficient inbound (or outbound) bandwidth, because such bottleneck endpoint(s) are more likely to delay CCT than other non-bottleneck endpoints. For example, in Fig. 2, we should place $C_{n}$’s logical bottleneck receiver $r_{2}$ on DC physical egress node $eg_{1}$ which has sufficient bandwidth to get minimum CCT (Fig. 2g). Because existing Coflows already claim some bandwidth from $eg_{2}$ and $eg_{3}$, placing bottleneck receiver $r_{2}$ on any of these busier nodes (Fig. 2b–e) would delay $C_{n}$’s bottleneck and extend CCT.

In contrast, placing non-bottleneck receiver $r_{1}$ on a busier node $eg_{2}$ is less harmful, as shown in Fig. 2g, because non-bottleneck receiver $r_{1}$ can tolerate certain amount of delay as long as its delay does not exceed the delay of bottleneck receivers $r_{2}$ and does not increase CCT.

We also note that a traditional flow-level placement algorithm may be suboptimal for Coflow. For example, authors in [9] place a Coflow’s constituent flows sequentially in decreasing order of their flow sizes, because “large flows are more likely to be critical to determine CCT” [9]. However, such a strategy might yield suboptimal solutions, as shown in Fig. 2d, f. Under this flow-level strategy, $C_{n}$’s largest flow $f_{2,2} = 15$ is most critical and may occupy the empty ingress node $ing_{3}$ shown in Fig. 2d, f. This placement leaves suboptimal ports of $ing_{1} { }$ or $ing_{2}$ for the bottleneck logical sender $s_{1}$ of $C_{n}$, leading to higher CCT compared with Fig. 2g.

On the other hand, as CCT is jointly determined by both Coflow placement and scheduling, if we can take scheduling algorithm into account when designing Coflow placement, their cooperation can benefit to reduce CCT. As PA scheduling algorithm in Section III.B uses four combined weight factors to represent the priority of a node, we design a Min-Priority (MP) Coflow placement algorithm which not only uses priority weight from PA but also follows from the motivation example in Fig. 2 to put bottleneck logical endpoint(s) of a Coflow at DC node(s) with minimum priority weight. Reference [6] found logical senders to be bottlenecks for almost a third of Coflows by studying the ratio of number of senders and receivers for each Coflow. So, we will first assign physical nodes to logical senders and then to logical receivers.

To enable the cooperation among the proposed MP placement algorithm and the PA algorithm, we use priority of each node to characterize Coflow contentions, as explained below. In Algorithm 1, we use $e_{i}^{ing}$ to represent existing priority of ingress node $i$, defined as sum of priorities of all VoQs in ingress node $i$ ($e_{i}^{ing} = \mathop \sum \limits_{j = 1}^{N} W_{ij}$), where W_i,j comes from Eq. (1). Similarly, $e_{j}^{eg}$ represents priority for egress node $j$, defined as sum of priorities of each VoQ sending flows to egress node $j$ from different ingress nodes ($e_{j}^{eg} = \mathop \sum \limits_{i = 1}^{N} W_{ij}$). We also use vectors $E^{ing} \left[ . \right]$ and $E^{eg} \left[ . \right]$ to represent the set of $e_{i}^{ing}$ and $e_{j}^{eg}$, respectively. We use vector $L^{s} \left[ {s_{1} , \ldots , s_{N} } \right]$ and $L^{r} \left[ {r_{1} , \ldots , r_{M} } \right]$ to indicate the weights which are going-to-be added to ingress (egress) nodes due to new Coflow demand requested on each logical sender (receiver) of $C_{n}$ (Lines 2 and 3 in Algorithm 1). The asymptotic complexity for MP Coflow placement algorithm is $O\left( M \right)$, where M is the number of flows in a Coflow.

Our proposed MP placement algorithm consists of two steps (see Algorithm 1). When a new Coflow comes, MP first iteratively updates going-to-be-added priority in $L^{s} \left[ {s_{i} } \right]$ and $L^{r} \left[ {r_{j} } \right]$ for each logical sender and receiver (in Lines 2 and 3) by adding $d_{i,j} *w_{l} + 1*w_{n}$, where $w_{l}$ and $w_{n}$ are size and number of flows weighting factors from Eq. (1). Second, MP considers each logical sender (or receiver) in descending order of their going-to-be-added priority, and then places logical sender (or receiver) onto ingress (or egress) nodes which are unlikely to delay a new Coflow’s bottleneck.

4 Illustrative numerical examples

4.1 Simulation settings

Our simulated workload is generated based on a data traces at Facebook [16] that were collected on a 3000-machine cluster with 150 racks. To evaluate the performance of our proposed PACPS algorithm, we conduct large-scale numerical experiment using a trace-driven simulator, which replays the Facebook trace. We obtain different traffic loads by adjusting the Coflow size using a scale factor, while preserving Coflow structure, input-to-output ratio of tasks, locality constraints, and inter-arrival times between jobs.

According to Chowdhury et al. [6], Coflows can be classified according to length (short vs. long) and width (narrow vs. wide). We consider a Coflow to be short, if its largest flow is less than 1 MB, and narrow, if it involves less than 50 flows. Vice versa for a long and wide flow. Hence, based on length and width, we consider 4 bins of Coflow, i.e., short and narrow (SN), long and narrow (LN), short and wide (SW), and long and wide (LW) (shown in Table 1a). To study the impact of PACPS algorithm on different Coflow mixes, we generate four mixes of Coflows (i.e., Mix-N, Mix-W, Mix-S, and Mix-L) by combining different types of Coflows from the 4 bins in Table 1a. The four mixes of Coflows and their components are shown in Table 1b. For example, Mix-N contains 800 Coflows of which 48% are SN Coflows, 40% are LN Coflows, 2% are SW Coflows, and 10% are WL Coflows. Simulating on these four Coflows mixes, we can investigate the performance of PACPS on different scenarios.

Table 1 Coflow bins and mixes

Full size table

We use “gap from optimal (GFO)” as the performance metric. GFO is calculated as $\left( {CCT^{real} - CCT^{opt} } \right)/CCT^{opt}$, where $CCT^{real}$ is real completion time incurred in our simulation and $CCT^{opt}$ is the time to complete Coflow transfer in an empty network (i.e., there is no other flow competing resource in network). The gap from the optimal time tells us the margin for performance improvement for different kinds of Coflow placement and scheduling algorithms. In addition, to study the gain of Coflow placement, we compare our proposed PA placement scheme with the case when there is no Coflow placement optimization, i.e., assigning a Coflow to a random sender and receiver, referred as “Random Placement”.

As some parallel computing applications are delay-sensitive, we study how PACPS algorithm helps Coflows meet their deadlines. To do deadline-constrained experiments, we set the deadline of a Coflow to be its minimum CCT in an empty network $t^{empty}$ multiplied by $\left( {1{ } + {\text{ U}}\left( {0,{\text{ x}}} \right)} \right)$, where ${\text{U}}\left( {0,{\text{ x}}} \right)$ is a uniform random number between 0 and $x$. Unless otherwise specified, $x = 1$ in this paper. We consider 200 ms as the minimum deadline as in [6].

Our simulation is based on a PSON with 80 ToR switches, $40 \times 40$ AWGRs, and $1 \times 2$ space switches. Each ToR received input traffic generated by trace-driven simulator mentioned above. The transmission speed of a wavelength was 10 Gbps, and each ingress node had a buffer shared by all VoQs.

4.2 Numerical results analysis

In Fig. 3, we compare the GFO of Min-Priority placement algorithm to Min-Load and Min-Num placement algorithms. All of the three algorithms are combined with PA scheduling algorithm. For Min-Load placement, only the size of flows in ingress nodes is considered for priority when placing a new Coflow, which is actually the same as 2D-placement algorithm in [8], i.e., placing the largest Coflow first. For Min-Num placement, only the number of flows in ingress nodes is considered for priority when doing Coflow placement, i.e., placing the narrowest Coflow first. Under light load with scale factor < 0.25, the three placement algorithms have similar performance, because there is sufficient bandwidth for new Coflows in the network. So, CCT is less sensitive to Coflow placement. Under medium or heavy workload, the placement of current Coflows is critical to improve performance of incoming Coflows. Min-Priority placement algorithm outperforms Min-Load and Min-Num placement algorithms under various scale factors, as it effectively cooperates with PA scheduling algorithm. This result demonstrates the benefit of joint optimization of Coflow placement and scheduling algorithms.

Figure 4 shows the GFO of applying Min-Priority placement algorithm applying with three different scheduling algorithms namely, Priority-Aware (PA), Aalo, and Varys. PA scheduling always achieves the best performance in terms of GFO compared to the Aalo and Varys, as these last two are designed with fixed-placement. This result further confirms that Coflow performance can benefit from the cooperation between jointly-optimized Coflow placement and scheduling algorithms.

In Fig. 5, we compare GFO of Min-Priority placement with Min-Num and Min-Load algorithms running on four different mixes. We use PA as default scheduling algorithm here. Min-Priority still achieves the best performance for each Coflow mix. For Mix-S, as more than 50% of Coflows are short and narrow, number of flows plays a more important role than it in other Coflow mixes. As Min-Num scheme considers the number of Coflows and thus is suitable for Mix-S scenario, it can obtain similar GFO with Min-Load. For Mix-L, since we have 58% long Coflows, load is more important than the number of flows. Therefore, the merit of Min-Load is more significant than Min-Num in terms of GFO gain.

Figure 6 shows the percentage of Coflows that meet their deadlines when applying Min-Priority placement algorithm compared with Random Placement. For default scale factor 1, we note that Min-Priority guarantees at least 61.3% Coflows to meet their deadlines for a strict deadline constraint (i.e., $x = 0.1$). As deadline requirement becomes even more flexible, Min-Priority placement can help more Coflows to finish on time, compared with random placement. This confirms the importance of Coflow placement on minimizing CCT.

5 Conclusion

An increasing number of DPCAs is driving the requirement for scalable, low-latency, high-speed, and energy-efficient datacenter networks. In this study, we consider the PSON as an interconnect solution for data-parallel clusters within datacenters. To support concurrent flows between data-parallel clusters, known as Coflow, and to minimize CCT, we propose a PACPS, which consists of two cooperating algorithms, the Min-Priority placement algorithm and the PA scheduling algorithm. PA scheduling is designed based on flow-level characteristics of Coflows and PSON architecture, and Min-Priority placement is designed to cooperate with PA to jointly minimize CCT. We numerically investigated the benefits of joint design of Coflow placement and scheduling by comparing to other algorithms designed without this kind of cooperation. Simulation results show that joint design of placement and scheduling can achieve smaller “Gap from optimal CCT”. We also investigated how PACPS outperforms other algorithms for different mixes of Coflows and how PACPS can help Coflows meet their deadlines.

References

Index, C.G.C.: Forecast and Methodology, 2015–2020. Cisco Systems Inc, San Jose (2016)
Google Scholar
Ananthanarayanan, G., et al.: Coordinated memory caching for parallel jobs. NSDI 20 12 (2012)
Zahariaetal, M., et al.: Resilient distributed datasets: a fault-tolerant P2 abstraction for in-memory cluster computing. NSDI 2 (2012)
Chowdhury, M., et al.: Coflow: a networking abstraction for cluster applications. In: 11th ACM Workshop on Hot Topics in Networks (2012)
Dogar, F., et al.: Decentralized task-aware scheduling for datacenter networks. ACM SIGCOMM Comput. Commun. Rev. 44(4), 431–442 (2014)
Article Google Scholar
Chowdhury, M., et al.: Efficient coflow scheduling with varys. ACM SIGCOMM Comput. Commun. 44(4), 443–454 (2014)
Article Google Scholar
Chowdhury, M., et al.: Efficient coflow scheduling without prior knowledge. ACM SIGCOMM Comput. Commun. Rev. 45(4), 393–406 (2015)
Article Google Scholar
Huang, X., et al.: Exploiting inter-flow relationship for coflow placement in datacenters. In: Proceedings of the 1st Asia-Pacific Workshop on Networking, pp. 113–119, Hong Kong, China (2017)
Munir, A., et al.: Network scheduling aware task placement in datacenters. In: Proceedings of the 12th International on Conference on Emerging Networking Experiments and Technologies, Irvine, pp. 221–235 (2016)
https://goo.gl/F5GZ8g
Kachris, C., et al.: A survey on optical interconnects for data centers. IEEE Commun. Surv. Tutor. 14(4), 1021–1036 (2012)
Article Google Scholar
R. Proietti, et al. “40 Gb∕s 8 × 8 low-latency optical switch for data centers,” Optical Fiber Communication Conf., Los Angeles, CA, 2011.
Xi, K., et al.: Petabit optical switch for datacenter networks. Tech. Rep., Polytechnic Institute of New York University, New York (2010)
Liboiron-Ladouceur, O., et al.: Energy-efficient design of a scalable optical multiplane interconnection architecture. IEEE J. Sel. Top. Quantum Electron. 17(2), 377–383 (2010)
Article Google Scholar
Wang, L., et al.: Scheduling with machine-learning-based flow detection for packet-switched optical data center networks. J. Opt. Commun. Netw. 10(4), 365–375 (2018)
Article Google Scholar
Chowdhury, M., et al.: Leveraging endpoint flexibility in data-intensive clusters. ACM SIGCOMM Comput. Commun. Rev. 43(4), 231–242 (2013)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) [B0117-16-1008, Development of datacenter Optical Networking Core Technologies for Photonic Frame-based Packet Switching]. It was also supported in part by National Science Foundation Grant No. 1716945.

Author information

Authors and Affiliations

University of California Davis, Davis, CA, USA
Lin Wang, Massimo Tornatore & Biswanath Mukherjee
Facebook, Inc, Cambridge, MA, USA
Xinbo Wang
Politecnico Di Milano, Milano, Italy
Massimo Tornatore
Electronics and Telecommunications Research Institute, Gwangju, Korea
Kwangjoon Kim

Authors

Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Tornatore
View author publications
You can also search for this author in PubMed Google Scholar
Kwangjoon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Biswanath Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lin Wang or Massimo Tornatore.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, X., Tornatore, M. et al. Coflow scheduling and placement for packet-switched optical datacenter networks. Photon Netw Commun 43, 156–164 (2022). https://doi.org/10.1007/s11107-021-00958-4

Download citation

Received: 19 May 2020
Accepted: 25 October 2021
Published: 14 February 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11107-021-00958-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Coflow scheduling and placement for packet-switched optical datacenter networks

Abstract

Similar content being viewed by others

Dynamic Priority Coflow Scheduling in Optical Circuit Switched Networks

A high-throughput energy-efficient passive optical datacenter network

Optical Switching in Data Centers: Architectures Based on Optical Packet/Burst Switching

1 Introduction