Keywords

1 Introduction

Satellite communication can provide seamless wireless signal coverage to support and expand ground communication, which has become an important research direction of 5G and future 6G [1,2,3]. With the large-scale deployment of OneWeb, StarLink, TeleSat and other mega constellations, LEO satellite shows its advantages in reducing communication delay, providing wide area coverage, and not affected by the ground environment. However, due to the high mobility of LEO satellite, terminals need to switch frequently to maintain the connection of communication links. This will increase the signaling overhead and drop call rate of the system, and seriously affect the system throughput. Therefore, more and more scholars have studied the handoff problem of LEO satellite, hoping to reduce the impact of high-speed mobility.

According to whether the direction angle of satellite beam in LEO system is adjustable, LEO system is usually divided into two types: one is satellite fixed cell system (SFCS), the other is earth fixed cell system (EFCS) [4]. EFCS is also called staring beam satellite system. In [5], a new satellite handoff strategy based on the potential game of mobile terminal in LEO satellite communication network is proposed. In the software defined satellite network (SDSN) architecture, the author regards satellite handoff as a bipartite graph and proposes a terminal random-access algorithm based on the target of user space maximization. In [6], a fixed beam LEO satellite model is introduced, and the author proposes a method to analyze the throughput of fixed beam according to its coverage time. In [7], the authors propose a performance comparison of fixed and dynamic channel allocation techniques in a LEO satellite system, and they study the case of earth-fixed cell systems with different kinds of fixed and mobile users. In [8], the author presented a comprehensive literature review on applications of deep reinforcement learning (DRL) in communications and networking. In [9], a dynamic channel reservation (DCR) strategy based on deep Q network is proposed for multi-service LEO satellite communication system, which can improve the overall quality of service (QoS) of the system. Inspired by this, we will try to solve the staring beam scheduling problem by reinforcement learning.

The rest of the paper is organized as follows. In Sect. 2, we introduce the LEO satellite beam scheduling system model and an optimal problem is proposed under the constraints of the number of satellite beams and satellite capacity. In Sect. 3, we solve the problem by using a multi-agent DQN learning algorithm. Simulation results are analyzed in Sect. 4 and conclusions are drawn in Sect. 5.

2 System Model

We consider the problem of LEO satellite beam scheduling during satellite operation, as shown in Fig. 1. The set of satellites is denoted by \(\mathcal M=\{1,2,...,M\}\). In this paper, the satellite is equipped with a multi beam phased array antenna system, so it can gaze at one or more hot spots by adjusting the parameters of the antenna array. Each satellite can form up to K beams, which is denoted by \(\mathcal K=\{1,2,...,K\}\). The earth’s surface is divided into a fixed number of hot spots according to the degree of user service demand, and the set of communication hot spots is denoted by \(\mathcal N=\{1,2,...,N\}\). Figure 1 shows the coverage of two orbiting satellites to the hot spot area on the ground. The yellow dotted line is the satellite orbit. From t1 to t2, area A is within the coverage of Leo1, so Leo1 can adjust the antenna to maintain the connection to area A. At t3, leo1 exceeds the visible range of area A and starts to serve area F. At t2, because both leo1 and leo2 are within the visible range of area A, that is, a region may have multiple satellites covering at the same time. Also when the service satellite of a region leaves, it is necessary for this region to access another satellite to ensure that the communication will not be interrupted. Therefore, staring beam satellite system mainly involves global satellite beam scheduling problem.

Fig. 1.
figure 1

LEO satellite system model of staring beam.

Fig. 2.
figure 2

Two dimensional model of staring beam satellite system.

In order to simplify the model, this paper uses a two-dimensional plane model to model the staring beam scheduling problem, as shown in Fig. 2. A series of hot spots (red spots) are evenly distributed in the plane area. The satellite (blue dot) moves along the fixed orbit (yellow dotted line) at the same speed. When the satellite moves beyond the plane area, it will appear from the other side of the map and continue to move along the track. Each satellite can provide staring services for one or more regions at the same time, which is constrained by the maximum number of beams and the elevation angle between the satellite and the staring region. And each region can also accept the service of multiple satellites.

When the satellite serves a hot spot area, this paper assumes that the satellite can completely cover this area to avoid the discussion of incomplete beam coverage. When dealing with staring beam scheduling, there are three main factors considered in this paper:

  1. 1)

    Satellites should provide services to the nearest hot spots as far as possible to improve the service quality;

  2. 2)

    When the satellite moves out of a hot spot area, other satellites should continue to cover the area to reduce the drop rate of the area;

  3. 3)

    The satellite load capacity and the capacity of hot spot area should be considered in beam scheduling. If the satellite capacity is not enough to fully serve the current region, other nearby satellites should participate in the service to ensure the service quality.

Assume that the capacity of the satellite is \(U=\{u_1,u_2,\ldots , u_N\}\), the remaining beam of each satellite is \(B=\{b_1,b_2,\ldots ,b_M\}\), and the capacity requirement of hot spot area is \(L=\{l_1,l_2,\ldots , l_N\}\). When satellite i is connected with hot spot j, satellite i will provide services for hot spot j as much as possible. At this time, the remaining capacity of satellite i is:

$$\begin{aligned} s_{i}^{\prime }=s_{i}-\min \left( u_{i}, l_{j}\right) \end{aligned}$$
(1)

the remaining capacity requirement of hot spot j is:

$$\begin{aligned} u_{j}^{\prime }=u_{j}-\min \left( u_{i}, l_{j}\right) \end{aligned}$$
(2)

and the number of remaining beams of satellite i is:

$$\begin{aligned} b_{i}^{\prime }=b_{i}-1 \end{aligned}$$
(3)

Considering the two-dimensional plane model we built, the elevation relationship between the hot spot area and the satellite will be transformed into the Euclidean distance between them. If the satellite i coordinate is \(\left( x_{i, 1}, x_{i, 2}\right) \) and the hot spot area j coordinate is \(\left( y_{j, 1}, y_{j, 2}\right) \), then the Euclidean distance between the satellite and the hot spot area can generate the weight matrix of M rows and N columns:

$$\begin{aligned} \boldsymbol{W}= \begin{bmatrix} w_{11} &{} w_{12} &{} \cdots &{} w_{1N} \\ w_{21} &{} w_{22} &{} \cdots &{} w_{2N} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ w_{M1} &{} w_{M2} &{} \cdots \ &{} w_{MN} \\ \end{bmatrix} \end{aligned}$$
(4)

where \(w_{i j}=\sqrt{\sum _{k=1}^{2}\left( x_{i k}-y_{j k}\right) ^{2}}, x_{i} \in X, y_{j} \in Y\). Then we introduce the beam allocation matrix F of satellite and hot spot area as follows:

$$\begin{aligned} \boldsymbol{F}= \begin{bmatrix} f_{11} &{} f_{12} &{} \cdots &{} f_{1N} \\ f_{21} &{} f_{22} &{} \cdots &{} f_{2N} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ f_{M1} &{} f_{M2} &{} \cdots \ &{} f_{MN} \\ \end{bmatrix} \end{aligned}$$
(5)

where \(f_{m, n} \in \{0,1\}\), \(f_{m,n}=1\) means that satellite m allocates a beam to hot spot area n. The service radius of the satellite is R. And the matrix F can be obtained by:

$$\begin{aligned} f_{m,n}=\left\{ \begin{array}{rcl} 0, \quad &{} &{} {w_{m,n}>R, or\; u_m=0, or\; l_j=0}\\ 1, \quad &{} &{} {others}\\ \end{array} \right. \end{aligned}$$
(6)

We introduce \(c_{i,j}\) to represent the capacity of satellite i allocated to hot spot area j, then the capacity allocation matrix C can be expressed as:

$$\begin{aligned} \boldsymbol{C}= \begin{bmatrix} c_{11} &{} c_{12} &{} \cdots &{} c_{1N} \\ c_{21} &{} c_{22} &{} \cdots &{} c_{2N} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ c_{M1} &{} c_{M2} &{} \cdots \ &{} c_{MN} \\ \end{bmatrix} \end{aligned}$$
(7)

In this paper, incomplete service rate, handover rate and insufficient capacity rate are used to measure the performance of beam allocation. Incomplete service rate \(P_b\) refers to the proportion of hot spot area whose business requirements can’t be met. It can be given by:

$$\begin{aligned} P_{b}=\frac{\sum \limits _{j \in \mathcal N} g_{j}}{N} \end{aligned}$$
(8)

where \(g_{j}\) is:

$$\begin{aligned} g_{j}=\left\{ \begin{array}{ll} 0, \quad &{} \sum \limits _{i \in \mathcal M} C_{i j}<l_{j} \\ 1, \quad &{} \text{ others } \end{array}\right. \end{aligned}$$
(9)

Handover rate \(P_{h}\) is used to measure the frequency of satellite switching hot spots in the service process, which is given by:

$$\begin{aligned} P_{h}=\frac{\sum \limits _{n \in \mathcal N}\sum \limits _{m \in \mathcal M} h_{m,n}^t}{\sum \limits _{n \in \mathcal N}\sum \limits _{m \in \mathcal M} f_{m,n}^t} \end{aligned}$$
(10)

where \(h_{m,n}^t\) is:

$$\begin{aligned} h_{m,n}^t=\left\{ \begin{array}{ll} 0, \quad &{} f_{m,n}^t = f_{m,n}^{t+1} \\ 1, \quad &{} \text{ others } \end{array}\right. \end{aligned}$$
(11)

\(P_c\) is the insufficient capacity rate of hot spot area demand, which is a supplement to \(P_b\). It can be expressed as:

$$\begin{aligned} P_{c}=\frac{\sum \limits _{i \in \mathcal M}\sum \limits _{j \in \mathcal N} c_{i,j}}{\sum \limits _{j \in \mathcal N} l_{j}} \end{aligned}$$
(12)

Then the staring beam scheduling problem is formulated as follows:

$$\begin{aligned}&\min _{c_{m,n}} \quad P=\alpha _1P_b + \alpha _2P_h + \alpha _3P_c&\\ \end{aligned}$$
(13)
$$\begin{aligned} \text{ s.t. }\quad&\sum \limits _{n \in \mathcal N}f_{m,n} \le K, \quad \forall m \in \mathcal M \end{aligned}$$
(13a)
$$\begin{aligned}&\sum \limits _{n \in \mathcal N}c_{m,n} \le u_m, \quad \forall m \in \mathcal M \end{aligned}$$
(13b)
$$\begin{aligned}&f_{m,n} \in \{0,1\}, \quad \forall m \in \mathcal M, \; \forall n \in \mathcal N \end{aligned}$$
(13c)
$$\begin{aligned}&c_{m,n} \in [0,u_m], \quad \forall m \in \mathcal M, \; \forall n \in \mathcal N \end{aligned}$$
(13d)

where \(\alpha _1\), \(\alpha _2\), and \(\alpha _3\) are positive parameters.

3 Multi-agent Deep Q-Learning Algorithm

Deep Q-Learning (DQN) algorithm is a classic and effective reinforcement learning algorithm, which can solve complex problems in many communication scenarios. We extend the deep Q-learning to the multi-agent cases to solve the problem of staring beam scheduling. In the multi-agent DQN model, the learning and decision of each agent are realized by DQN algorithm. \(s_{e, t}^{m, k}, a_{e, t}^{m, k}, r_{e, t}^{m, k}\), and \(s_{e, t+1}^{m, k}\) represent the state, action, reward and the next state of agent (satellite) i at time t of the e-th training round. The online Q function fitted by neural network and the objective Q function are randomly initialized. With the continuous interaction between the agent and the environment, the generated action sequence is stored in the experience pool. In order to minimize the error function \(L\left( \theta ^{m}\right) \), a batch of sequences are randomly selected from the experience playback pool every certain interval. \(L\left( \theta ^{m}\right) \) is given by

$$\begin{aligned} L\left( \theta ^{m}\right) =\left( r_{j}^{m}+\gamma \max _{A^{i}} Q^{m}\left( s_{j+1}^{m}, a^{m} ; \theta ^{m-}\right) -Q^{m}\left( s_{j}^{m}, a_{j}^{m} ; \theta ^{m}\right) \right) ^{2} \end{aligned}$$
(14)

where \(\gamma \) is the discount factor, \(\theta ^{m-}\) is the parameters of target value network, and \(\theta ^{m}\) is the parameters of online value network.

figure a

The multi-agent deep Q-learning algorithm is shown in Algorithm 1. State \(s_{e, t}^{m}=\left[ W_{e, t}^{i}, a_{e, t-1}^{m}, a_{e, t-2}^{m}, C_{e, t}^{m}\right] \) represents the allocation state of the k-th beam of satellite m of the e-th round during training at the time t. The reward for satellite m performing action \(a_{e, t}^{m, k}\) under the k-th beam at the time t in the e-th round is \(r_{e, t}^{m, k}\). and it can be expressed as

$$\begin{aligned} r_{e, t}^{m, k}=-r\_l_{e, t}^{m, k}-a \times r\_b_{e, t}^{m, k}+b \times r\_c_{e, t}^{m, k} \end{aligned}$$
(15)

where a is the penalty coefficient of disconnection, and b is the reward coefficient of connection.

Here \(r\_l_{e, t}^{m, k}\) is the distance penalty, which aims to make the satellite serve the nearest area as far as possible. \(r\_b_{e, t}^{m, k}\) is the disconnection penalty and \(r\_c_{e, t}^{m, k}\) is the connection reward. In this way, the handover rate can be reduced. And the three parameters is given by

$$\begin{aligned} r\_l_{e, t}^{m, k}=\sqrt{\sum _{k=1}^{2}\left( x_{i k}-y_{j k}\right) ^{2}}, x_{i} \in X, y_{j} \in Y \end{aligned}$$
(16)
$$\begin{aligned} r\_b_{e, t}^{m, k}=\left\{ \begin{array}{cc} -La &{} \quad c_{m,n}^{t-1}=1 \; {and}\; c_{m,n}^{t}\ne 1 \\ 0 &{} \quad \text{ else } \end{array}\right. \end{aligned}$$
(17)
$$\begin{aligned} r\_c_{e, t}^{m, k}=\left\{ \begin{array}{cc} \sqrt{\sum \limits _{j=1}^{2}\left( x_{m, j}-y_{n,k}\right) ^{2}}, x_{m} \in X, y_{a_{n}} \in Y &{} \quad c_{m,n}^{t-1}=c_{m,n}^{t}=1 \\ 0 &{} \quad \text{ else } \end{array}\right. \end{aligned}$$
(18)

By synthesizing the three kinds of rewards and adjusting their coefficients, the agent is expected to learn the corresponding strategies to meet the performance requirements.

4 Simulation Results

In this section, we present the simulation results of proposed strategy what we called multi-agent DQN and compare it with Beam scheduling algorithm based on KM. In this paper, we consider that a reasonable beam scheduling scheme should minimize the sum of the connection distances between the satellite and the hot spot region, that is, to find the minimum value of the edge weight of the bipartite graph \(G = \left( V, E\right) \). So the problem is transformed into the optimal matching of bipartite graph, which can be solved by the classical KM algorithm.

The simulation parameters are summarized in Table 1. We consider a square area with 500 km side length, where hot spot areas are randomly distributed within the area. When the number and capacity of satellites are fixed, we focus on the impact of the maximum number of beams and the number of hot spots on the algorithm performance. In terms of satellite capacity, refer to StarLink single satellite capacity (17 Gbps) and Iridium system single satellite capacity (7.5 Gbps), the satellite capacity is set to 3 Gbps and the maximum capacity demand of hot spot area is 1 Gbps.

Table 1. Simulation parameters
Fig. 3.
figure 3

The performance of the system varies with the penalty coefficient a of disconnection.

Fig. 4.
figure 4

The performance of the system varies with the reward coefficient b of connection.

In Fig. 3 and Fig. 4, we compared the effect of the penalty coefficient a of disconnection and the reward coefficient b of connection on the beam scheduling performance. It can be found that when \(a=1.0\) and \(b = 0.75\), the handover rate of the system reaches the lowest, and the rate of incomplete service and insufficient capacity rate are also at a low value. So we set \(a=1.0\) and \(b = 0.75\) in the following simulation.

Figure 5 shows the influence of the number of beams of satellite on the handover rate, incomplete service rate and insufficient capacity rate. With the increase of the maximum number of beams, the performance of KM algorithm and multi-agent DQN algorithm is improved. When the number of beams is small, the performance of KM algorithm is better than that of multi-agent DQN algorithm. However, with the increase of the number of beams, their coverage performance is almost the same, but the multi-agent DQN is significantly better than KM algorithm in the handover rate. This is because the punishment of satellite switching is strengthened in the training, so the satellite connection strategy is adjusted.

Fig. 5.
figure 5

The performance of the system varies with the maximum number of beams.

Fig. 6.
figure 6

The performance of the system varies with the number of hot spot area.

Figure 6 shows the influence of the number of hot spot area on the handover rate, incomplete service rate and insufficient capacity rate. As the number of hot spots increases, the performance of both algorithms decreases. The handover rate of multi-agent DQN algorithm is always better than that of KM algorithm. But its coverage performance is not as good as KM. In general, by adjusting the reward setting, the reinforcement learning algorithm achieves better performance in the handover rate. In the future, when the number of satellites is increasing, the coverage performance will not be the main consideration, while reducing the handover rate can greatly reduce the signaling overhead of the system, which indicates that multi-agent DQN algorithm is a desirable staring beam scheduling algorithm.

5 Conclusion

In this paper, we have investigated the staring beam scheduling problem in LEO satellite network. By establishing a two-dimensional model, an optimal problem is proposed under the constraints of the number of satellite beams and satellite capacity. We have solved the problem by multi-agent DQN algorithm. Compared with the beam scheduling algorithm based on KM, multi-agent DQN algorithm can adjust the weight to change the scheduling strategy to meet the optimization requirements. Simulation results have shown that the proposed algorithm can reduce the handover rate, which is of great significance to reduce the network signaling overhead.