Keywords

1 Introduction

The Satellite-Terrestrial Integrated Network (STIN) is promised to provide land, sea, air, and space users with any time, global coverage, on-demand services, and safe and reliable information services [1]. The Low Earth Orbit (LEO) satellites based on STIN, such as OneWeb, SpaceX, and Telesat systems, have been providing broadband Internet access services for areas with underdeveloped telecommunication infrastructure [2].

However, STIN features a heterogeneous structure with wide-area distributed and highly dynamic nodes, limited onboard wireless resources, non-negligible delay, and severe fading. Moreover, moving at a rapid speed at a low altitude, LEO satellites have a relatively short period of view, causing frequent handovers between the ground terminal and the satellites. To continue communication with the counterpart, the user has to switch among the covered satellites. Up to now, SpaceX has launched nearly 3108 satellites for Starlink [3]. In this scenario, it is increasingly common to see multi-satellites covering the same area simultaneously. Such a complex and dynamic environment greatly challenges wireless resource management.

In the current STIN access selection scheme, the single target access algorithm and multi-target weighting algorithm [4] are unsuitable for access decisions for dynamic networks and fail to ensure the QoS requirements of different users flexibly. Recently, AI (artificial intelligence) algorithms have been applied to communication. Especially, Q-learning was proposed to continuously confirm the optimal access choice in the continuous interactive learning with the environment.

This paper proposes an intelligent access resource strategy based on reinforcement learning for STIN, aiming to select the best-joining satellite for multiple users covered by multi-LEO satellites with limited onboard resources at high speed. The remainder of this paper is as follows. Section 2 illustrates the reference scenario and introduces the system model. Section 3 presents the proposed access algorithm based on reinforcement learning in detail. The proposed method is eventually validated through simulation in Sect. 4. Finally, the conclusions of this paper are drawn in Sect. 5.

2 System Model

2.1 Scenario

We introduce a general formalism for the general multi-star coverage scenario. The area covered by the satellite in the ground plane is as shown in Fig. 1(a). user\(_1 \) is simultaneously located under the signal coverage of the satellites LEO\(_1 \) and LEO\(_2 \). After the satellite returns the result, the ground control center sends it to the user to complete the satellite access. Supposing the network composed of n users and m satellites, we get :

$$\begin{aligned} U=\{u_1, u_2, u_3,\dots , u_n \} \end{aligned}$$
(1)
$$\begin{aligned} S=\{s_1, s_2, s_3, \dots , s_m \} \end{aligned}$$
(2)

where U is the user set and S is satellite set. We assume multi-satellites covering the same area simultaneously. For example, if two satellites cover user\(_i\) , the set of satellites \( S_i \) that can serve the user is:

$$\begin{aligned} S_i=\{s_1,s_2 \} \end{aligned}$$
(3)

where \(i\in \{1,2,\dots ,n\}\).

Fig. 1.
figure 1

Figure 1(a) shows the multi-star coverage model, and Fig. 1(b) shows the satellite coverage map of the earth . In Fig. 1(b), r represents the radius of the earth. \( \omega \) represents the elevation angle of the satellite, which is the angle between the user and the satellite connection to the horizontal line, h is the height of the satellite relative to the ground, l represents the distance from the user to the satellite, and \( \phi \) represents the satellite service area on the ground. \( \theta \) represents the included angle of the satellite relative to the user.

2.2 Parameter Evaluation

We consider the multi-parameters of the integrated satellite network from the physical parameters of a single low-orbit satellite and the overall capacity of the traffic load, including the satellite elevation angle, coverage time, and the available channels.

Figure 2 shows the satellite-to-ground diagram, in which the \( \phi \) is:

$$\begin{aligned} \varphi =\cos ^{-1}\left[ \frac{r}{r+h} \cdot \cos \omega \right] -\omega \end{aligned}$$
(4)

Then we get the average radius of coverage area:

$$\begin{aligned} r^{\prime }=r \cdot \sin \varphi \end{aligned}$$
(5)

So size of the area is:

$$\begin{aligned} \textrm{s}=2 \pi r^{2} \cdot (1-\cos \varphi ) \end{aligned}$$
(6)

Assuming satellites move around the earth in a uniform circular motion. The period T can be obtained as:

$$\begin{aligned} T=2 \pi \sqrt{\frac{(r+h)^{3}}{\mu }} \end{aligned}$$
(7)

where \( \mu \) = 398601.58 km\(^3/\)s\(^2\) is the Kepler constant. Therefore, the coverage time of the satellite to the ground is:

$$\begin{aligned} T_{s}=\frac{2 \varphi }{360} \cdot T \end{aligned}$$
(8)

3 The Proposed Access Algorithm Based on Q-Learning

Fig. 2.
figure 2

Structure diagram of multi-satellite access scheme based on Q-learning.

3.1 Q-Learning Algorithm

Markov Decision Process (MDP) [5] is defined by a tuple (SApr) with explicit state transition properties. In the tuple, S represents states’ finite set, A represents actions’ finite set, p is a transition probability, and r represents the immediate reward obtained from state s to state \(s^{\prime }\) after the execution of the action a. \(\pi \) is denoted as a “policy" that represents a mapping from a state to action. The goal of a time-infinite MDP is to maximize the expected discounted total reward or maximize the average reward:

$$\begin{aligned} {\max _{\pi }} \mathbb {E} \left[ { {\sum _{t=0}^{T}} \gamma r_{t}(s_{t},\pi (s_{t})) }\right] \end{aligned}$$
(9)

where \(\gamma \in [0,1]\) represents the discount factor, which determines the great significance of future rewards compared with the current reward. We aim to find an optimal policy \(\pi \prime :\mathcal S\rightarrow \mathcal A\) and define value function \(\mathcal {V}^{\pi }:~\mathcal {S} \rightarrow \mathbb {R}\) that represents the expected value obtained by following policy \(\pi \) from each state \(s\in \mathcal S \). The value function is:

$$\begin{aligned} \begin{aligned} \mathcal {V}^{\pi }(s)=&\mathbb {E}_{\pi } \left[ { \sum _{t=0}^{\infty } \gamma r_{t}\left( {s_{t},a_{t}}\right) |s_{0}=s }\right] \\ =&\mathbb {E}_{\pi }\left[ { r_{t}\left( {s_{t},a_{t}}\right) + \gamma \mathcal {V}^{\pi }\left( {s_{t+1}}\right) |s_{0}=s }\right] \end{aligned} \end{aligned}$$
(10)

As we need to find the optimal policy \(\pi ^{*}\) , an optimal action at each state can be found through: \( \mathcal {V}^{*}(s) = \max \limits _{a_{t}} \{ \mathbb {E}_{\pi }[r_{t}(s_{t},a_{t}) + \gamma \mathcal {V}^{\pi }(s_{t+1})] \} \).

We define \(\mathcal {Q}^{*}(s,a) \triangleq r_{t}(s_{t},a_{t}) + \gamma \mathbb {E}_{\pi } [\mathcal {V}^{\pi }(s_{t+1})]\) as the optimal Q-function for all state-action pairs, then the optimal value function can be expressed as \(\mathcal {V}^{*}(s) = \max \limits _{a} \{ \mathcal {Q}^{*}(s,a) \}\). For all state-action pairs, this can be done through iterative processes [6]:

$$\begin{aligned} \begin{aligned} \mathcal {Q}_{t{+}1}(s,a)=&\mathcal {Q}_{t}(s,a) \\&{+}\alpha _{t} \left[ { r_{t}(s,a) {+} \gamma \max _{a'} \mathcal {Q}_{t}(s,a') {-} \mathcal {Q}_{t}(s,a) }\right] \end{aligned} \end{aligned}$$
(11)

The core idea behind this update is to find the Temporal Difference (TD) between the predicted Q-value.

3.2 Algorithm Structure

The overview of the proposed method is as shown in Fig. 2. State Evaluation Module is to collect the observed information of the STIN. And the Reinforcement Learning Module is the decision-making center to explore optimal access links by interacting with environmental information. The algorithm is shown in Algothrim 1. We denote \(\mathcal {Q}_t^{*}(s,a)\) as the optimal Q-function at t. \(s^{*}\) and \(a^{*}\) is the corresponding state and action.

figure a
Table 1. Parameter for Multi-satellite Environment in STK.

3.3 Q-Learning Based Access Resource Strategy Based Design

The proposed scheme designs the satellite network state as the state set, the alternative satellites as the action set, and the comprehensive network performance as the reward function of the selection strategy. The details are as follows:

State Space. Three parameters, i.e., the satellite elevation angle \(\omega \), the coverage time t, and the number of available channels c, are considered as the state space of Q-learning. These parameters are selected based on signal strength, service continuity, and load balancing considerations. So the state space of Q-learning is: This paper considers the double-satellite coverage scenario. So the state space complete formula is as follows:

$$\begin{aligned} \mathrm {~S(\omega ,t,c)}=\left\{ \left( \omega _{1}, t_{1}, c_{1}\right) ,\left( \omega _{2}, t_{2}, c_{2}\right) \right\} \end{aligned}$$
(12)

Action Space. The action for the satellite access scenario is the set of satellites to be selected for access. In Fig. 1(a), the set of satellites covered by the user in the action space is as follows:

$$\begin{aligned} A_{i}=\left\{ a_{1}, a_{2}\right\} \end{aligned}$$
(13)

This paper adopts \(\epsilon \)-greedy strategy, in which \(\epsilon \) is the exploration probability. The system generates a random \(\rho \in [0,1]\) to determine whether to take the action with the maximum value or a random action according to \(\rho \). The \(\epsilon \)-greedy strategy is as follows:

$$\begin{aligned} a_{\tau }=\left\{ \begin{array}{r} \arg \max Q(s, a), \varepsilon \le \rho \le 1 \\ {random}(A), 0 \le \rho \le \varepsilon \\ \end{array}\right. \end{aligned}$$
(14)
Table 2. Influence of state parameters on performance indicators.
Table 3. Weight of the parameters affecting business in (15).

Reward Function. The observed QoS of the entire communication network is designed as the reward, including packet loss, jitter, and delay. Considering the comprehensive impact of the selection strategy on network performance, we define a utility function:

$$\begin{aligned} \textrm{r}(\textrm{s}, \textrm{a})=\alpha _{\omega } U_{\omega }(\omega ^{*})+\alpha _{t} U_{t}(t ^{*})+\alpha _{c} U_{c}(c) \end{aligned}$$
(15)

where \( U_\omega (\omega ^{*}), U_t (t^{*}) \), and \( U_c (c) \) represent the satellite elevation angle, coverage time, and the benefit function of the available channel, respectively. \( \alpha _\omega \), \( \alpha _t \), and \( \alpha _c\) can be thought of as weights to the corresponding parameters. As shown in Table 3.

For the satellite elevation angle, the benefit function is:

$$\begin{aligned} U_{\omega }(\omega ^{*})=\sigma \left( \frac{\omega ^{*}-\omega _{\min }}{\omega _{\min }}\right) ^{2} \end{aligned}$$
(16)

where \(\omega ^{*}\) represents the current elevation angle, and \( \omega _{\min } \) is the minimum angle that the system can provide services. \( \sigma \in (0,1) \) is a normalization parameter selected according to factors such as the geographical environment. This formula reflects that the larger the satellite elevation angle, the better the signal quality.

For the utility function of satellite coverage time, the definition is given as follows:

$$\begin{aligned} U_{t}(t^{*})=\left\{ \begin{array}{r} \mu \left( \frac{t_{\max }}{t_{\max } t ^{*}}\right) ^{2}, t_{\max } \ne t ^{*} \\ 1, t_{\max } = t ^{*} \\ \end{array}\right. \end{aligned}$$
(17)

where \( t^* \) represents the current coverage time, \( t_{\max } \) is the longest satellite coverage time, and \( \mu \) is a normalization parameter. This formula shows that the longer the coverage time, the better the communication quality of the user. For the load situation of the channel, we use change in the available channels before and after the action is taken to measure whether the action is beneficial for load balancing. And the function is defined as:

$$\begin{aligned} U_{c}(c^{*})=\left\{ \begin{array}{l} 0, \varDelta c^{*}-\varDelta c<0 \\ 1, \varDelta c^{*}-\varDelta c>0 \end{array}\right. \end{aligned}$$
(18)

where \( c^* \) represents the current number of available channels. The difference in the number of available channels after the action selection measures whether the action benefits load balancing. If the difference is negative, the reward is 0. Otherwise, the reward is 1.

Finally, it is necessary to design the weights of \( \alpha _\omega \), \( \alpha _t \), and \( \alpha _c\). We comprehensively consider delay, jitter, and packet loss rate as the QoS measure. The effects of three state parameters \(\omega , t\), and c on these performances are in Table 2, which shows that pack loss is affected by both satellite elevation and the available channel. In contrast, the delay is only affected by the elevation. The value of the weights factor is as shown in Table 3.

4 Simulation and Result Analysis

4.1 Environment and the Parameters

We try to evaluate the availability of the algorithm in practical scenarios. The proposed access algorithm is simulated and verified. First, STK is used to build a low earth orbit (LEO) satellite to obtain satellite parameters, as shown in Table 1. After the parameters are obtained from the environment, they must be loaded into the reinforcement learning module for training through quantization processing. The specific quantization range is shown in Table 4.

Table 4. The actual parameter range corresponding to the quantized value in \( U_\omega (\omega ^{*}), U_t (t^{*}) \), and \( U_c (c^{*}) \) about the elevation angle, coverage time, and the number of available channels.

The parameters of Q-Learning Module are as shown in Table 5.

Table 5. Parameter for Q-Learning Model.

4.2 Result Analysis

For this algorithm, we set the number of training rounds to 500. Then we analyze the impact of the access selection algorithm on communication performance. And the convergence process of the Q-learning algorithm model based on a real-time communication system is first analyzed. We use a single utility function adopting the weighted sum of the satellite elevation angle, coverage time, and available channel for comparison. CLWA represents comprehensive weighting and static access algorithms in the following figures, and Q-Learning illustrates the proposed method.

Fig. 3.
figure 3

Convergence Process of Q-learning.

Convergence Analysis. As the training progresses, Fig. 3 shows that the Q-learning algorithm is converging. It shows that the agent can obtain the optimal access strategy from the satellite elevation angle, coverage time and the number of available channels to explore the STIN environment. It also shows the difference in algorithm convergence when the learning rates \(\alpha \) are 0.8 and 0.5, respectively. As demonstrated by the curve, when the learning rate is 0.8, the Q value changes faster and stabilizes earlier. \(\alpha \) determines the learning ability. The larger the \(\alpha \), the faster the learning speed under the premise of convergence.

Successful Access Rate Analysis. We measure this performance with access probability, which refers to the number of calls successfully connected to the satellite to the total number of calls. The access probability is related to the access algorithm and the busyness of the network. As shown in Fig. 4, when the number of call arrivals per unit time increases from 5 to 25, the access probability of the curves corresponding to the two algorithms first remains close to 100\(\%\), then gradually decreases, and finally remains around 50\(\%\). It is because when the call arrival is relatively low, the network load is relatively small, the call requests of all users can be satisfied, and the access probability is 1. As the call arrival rate increases, the network load gradually increases. As demonstrated in Fig. 4, compared with the Q-Learning algorithm, the access probability of the CLWA algorithm curve decreases first. Meanwhile, the access probability of the CLWA algorithm is lower than that of the Q-Learning algorithm, which indicates the proposed algorithm can improve the access probability of users, thereby providing higher communication quality and user satisfaction.

Fig. 4.
figure 4

Probability of complete call versus new call arrival rate.

Fig. 5.
figure 5

Satellite channel utilization versus call arrival rate.

Network Resource Utilization Analysis. We consider the impact of this algorithm on the utilization of the entire network resources. Channel utilization refers to the difference between the successfully utilized channels and the channel capacity in STIN. As shown in Fig. 5, as the number of calls increases, the channel utilization rises and then tends to stabilize, and its value is close to 1. When the call arrival rate is low, the network load is small, fewer channels are needed at this time, and the channel utilization rate is low. At the same time, it shows that the channel utilization rate of the CLWA algorithm is lower than that of the Q-Learning algorithm, and the time to reach the highest channel utilization rate is relatively late. It shows that the proposed Q-Learning-based access algorithm can better allocate the channel resources of the STIN and improve channel utilization.

5 Summary

This paper proposes a multi-objective integrated satellite access algorithm based on Q-learning for the satellite-terrestrial integrated network (STIN), aiming to select the optimal access satellite for multiple users covered by multi-LEO satellites with limited channel resources. We consider the multi-parameters, including the elevation angle of satellites, the coverage time, and the available channel related to the traffic load. According to the QoS requests, we design the access problem as a multi-objective optimal problem and adopt reinforcement learning to select the satellite. Finally, an LEO-based STIN is simulated in STK, and the proposed algorithm is implemented. Based on the results, we analyze the convergence of the algorithm and verify that the algorithm provides more efficient access selection by analyzing user satisfaction and network resource utilization.