1 Introduction

In recent years, we are witnessing an exponential growth in cellular trajectories, with the rapid development of mobile communication technology [1,2,3]. Spatio-temporal joint prediction is an important problem in the construction of intelligent communication system. It aims to predict the next location and the corresponding switch time for a cellular trajectory at the same time, which is a multitask joint prediction process. It is of great significance to decide when user data will be scheduled to which base station. Besides, the joint prediction benefits many applications such as efficient resources management in mobile communications, location-aware advertisements and navigation services [4].

Most existing trajectory prediction methods only focus on a single task, i.e., location prediction or time prediction. Methods of location prediction [5,6,7] use sequential models (e.g., HMM and RNNs) to capture trajectories’ spatial mobility regularities to predict the next location. Most methods of time prediction [8] utilize temporal point process (TPP) [9] to model temporal sequence. However, these single-task methods cannot simultaneously predict these two tasks to support the multitask requirements of trajectory prediction in most real scenarios. To achieve the joint prediction, multitask learning based methods [10, 11] are proposed to jointly utilize spatio-temporal signals to capture the mutual influence among these two tasks. Unfortunately, these multitask methods merely focus on the modeling of trajectories and ignore the importance of spatio-temporal contexts (e.g., traffic-related contexts), and accordingly, cannot provide accuracy prediction.

Accurate spatio-temoral joint prediction for cellular trajectories remains to be challenging due to complicated spatio-temporal dependencies and various context information. First, trajectory movement will be affected by the spatial distribution and temporal cycle of locations, because users usually tend to visit the nearby base stations and follow periodical patterns. Aside from the above geographical influence, traffic conditions also have a great influence on trajectory movements. Intuitively, the traffic congestion may affect the trajectory’s movement speed and even the choice of the next location. Thus, traffic-related contextual information should be taken into account to achieve accurate predictions. Finally, trajectory prediction also requires to consider various background factors (e.g., departure time, weekdays and weather) due to different mobility patterns. For example, the trajectories of rush hours are more likely to encounter traffic congestion and thus require more attention to traffic context information, while daily trajectories may need to pay more attention to sequential information. However, most studies merely model the sequential information of trajectories without learning the spatio-temporal contexts, which results in inaccurate predictions.

Furthermore, we find that one trajectory always has consistent travel intention, and thus the state of a trajectory point is impacted by its follow-up points’ states. The next location of a trajectory may be predicted inaccurately without the information of the travel intention from the trajectory. Nevertheless, once knowing the user’s destination is airport, we can guess that he will follow the airport road and thus predict the next location. Hence, a cellular trajectory’s location movement is influenced by its travel intention. Meanwhile, the travel intention of a trajectory can be predicted by a sequence of location switches. However, existing multitask methods ignore the signal of travel intention and thus cannot utilize it for spatio-temporal joint prediction.

To tackle the above issues, we propose a graph-contextualized multitask learning method called GCMT for spatio-temporal joint prediction, which adds travel intention prediction as an auxiliary task on the basis of spatio-temporal joint prediction. Specifically, we design a graph-based representation module, which constructs three relational graphs (i.e., location-location, location-region and location-time) and embeds each vertex into a shared low dimensional space. In this way, it can capture sequential effect, geographical influence and temporal cyclic effect. Then, we adopt a self-attention network to effectively model long and dense cellular trajectories. In addition, in order to capture recent traffic condition, we adopt a traffic encoder to model traffic dynamic in traffic flow data. The encoder consists of multiple ST-blocks, which combine temporal gated CNN with spatial graph convolution, to jointly learn spatial and temporal dependencies. Finally, considering the influence of various background factors (e.g., departure time, weekdays and weather), a context-attention mechanism is designed to fuse sequential information and traffic information for a more comprehensive prediction. We summarize our contributions as follows:

  • We propose a graph-contextualized multitask learning method for spatio-temporal joint prediction, which integrates representation module, trajectory encoder, traffic encoder, context modeling and task-specific decoder as a whole.

  • We adopt a graph-based representation module to jointly capture the sequential effect, geographical influence and temporal effect. Moreover, to learn traffic condition, a spatio-temporal block is designed to model spatial and temporal dependencies in traffic flow data.

  • Extensive experiments on two real-world trajectory datasets show that our model achieves the best performance among all state-of-the-art methods.

This paper is an extension of our previous work [12], and it makes the following major improvements:

  • We design a bipartite graph embedding module to embed location-location, location-region, and location-time graphs into a shared low dimensional space, so as to jointly study geographical and temporal effects of locations.

  • We adopt a spatio-temporal block to learn traffic information in traffic flow data, which combines temporal gated CNN with spatial graph convolution as an integration.

  • We introduce a context attention mechanism to take account of various background factors, by which our model can adaptively assign reasonable weights to sequence and traffic information.

The remainder of this paper is organized as follows. We first introduce the related work of trajectory prediction and multitask learning in Section 2. Then, several definitions are given and the research problem is formulated in Section 3. In Section 4, we present our proposed method GCMT. Finally, we conduct extensive experiments in Section 5 and conclude the paper in Section 6.

2 Related work

2.1 Trajectory prediction

As an important part in the construction of smart city, trajectories have been studied for many years [13,14,15,16,17,18]. Researches on trajectory prediction mainly focus on location prediction task [19,20,21] or time prediction task [22, 23]. Most location prediction studies utilize learning techniques like RNNs [6, 7] to model sequential information to make predictions in spatial domain. Considering the importance of spatial and temporal interval information, some methods [6, 7, 24] extend the framework of RNNs. HST-LSTM [6] introduces an add operation to existing gates of LSTM to merge spatial-temporal interval information. Flashback [24] does flashbacks on past hidden states to consider historical records with similar contexts. DeepMove [5] adds an attention mechanism to GRU to learn multi-level periodicity.

In addition, a few studies [8, 23] focus on the time prediction task. RCR [8] utilizes visitors and potential visitors’ historical check-ins to extract features, and adopts censored regression for time predictions. A recurrent spatio-temporal point process model [23] is further proposed to utilize TPP to improve the performance. However, they require a pre-designated next location and cannot support the time prediction for trajectories with unknown next location.

In summary, these single-task methods neglect that trajectory prediction requires both location and time prediction in most real scenarios, which cannot directly support the prediction of another task or even the joint prediction.

2.2 Multitask learning

Multitask learning (MTL) [25] aims to exploit meaningful information from other related learning tasks to solve multiple tasks at the same time, which has been successfully applied in many fields, such as computer vision [26]. Inspired by the success, a few multitask based methods [10, 11, 27] study the spatio-temporal joint prediction of events or POIs. RMTPP [10] combines RNN with TPP to support the joint prediction for events. ARNPP-GAT [27] uses graph attention networks to model user’s long term preference and combines it with an attention-based recurrent neural point process. IRNN [11] respectively utilizes RNN to model time and event sequence. DeepJMT [28] adopts a hierarchical RNN to capture temporal patterns and mobility regularities, which extracts location’s semantics, user’s periodicity, and social relationships to alleviate the data sparsity problem. IAMT [12] proposes an intention-aware multitask learning method, introducing travel intention prediction as an auxiliary task on the basis of spatio-temporal joint prediction to provide long-term intentional information. However, these methods ignore the influence of spatio-temporal contexts on trajectory movement. Thus, we propose our method to model complicated spatio-temporal dependencies and various context information.

3 Preliminaries

Definition 1 (Trajectory)

Let \({\mathscr{L}}=\{l_{1},l_{2},\cdots ,l_{n_{1}}\}\) denote a set of locations. A spatio-temporal point p is a tuple (l,t), where the location \(l\in {\mathscr{L}}\) can refer to a base station and the positive real number \(t\in \mathcal {R}^{+}\) presents the timestamp switching to the location l. A trajectory T is a time-ordered spatio-temporal point sequence T = {p1,p2,⋯ ,pm}. Besides, we represent the time interval of two consecutive spatio-temporal points as τ, i.e., τk = tktk− 1.

Definition 2 (Location-Location Graph)

Location-Location graph is denoted as \(\mathcal {G}_{ll}=({\mathscr{L}}\cup {\mathscr{L}}, \mathcal {E}_{ll})\). \({\mathscr{L}}\) is a set of locations and \(\mathcal {E}_{ll}\) is a set of edges between locations. Given a time interval ΔT, for each spatio-temporal point pair {(li,ti),(lj,tj)} in trajectory T, if 0 < tjtiΔT, there will be an edge eij from li to lj. The weight wij of eij is defined as the number of times that lj is visited after li in the trajectory dataset \(\mathcal {T}\) within the time interval ΔT.

Location-Location graph, as a general graph, captures sequential information and locations’ spatial distribution, which can be viewed as a bipartite graph when we set a location on one side and others on the other side. To further capture geographical and temporal effect, we construct Location-Region and Location-Time bipartite graphs as Figure 1 shows. Continuous values (i.e., geographical area and timestamps) are transformed into discrete ones due to the discrete vertices in graphs. We adopt a grid-based partition method [29, 30] and divide the geographical space into N = ω × ω regions \(\mathcal {R}\). Besides, all timestamps are divided into a set of time slots \(\mathcal {S}\) based on the hours of a day.

Fig. 1
figure 1

Illustration of three relational graphs

Definition 3 (Location-Region Graph)

Location-Region graph, denoted as \(\mathcal {G}_{lr}=({\mathscr{L}}\cup \mathcal {R}, \mathcal {E}_{lr})\), is a bipartite graph. \({\mathscr{L}}\) is a set of locations and \(\mathcal {R}\) is a set of regions. \(\mathcal {E}_{lr}\) is a set of edges between locations and regions. If location li is in region rj, there will be an edge eij between them and the weight wij is set as 1; otherwise, none.

Definition 4 (Location-Time Graph)

Location-Time graph, denoted as \(\mathcal {G}_{ls}=({\mathscr{L}}\cup \mathcal {S}, \mathcal {E}_{ls})\), is a bipartite graph, where \(\mathcal {E}_{ls}\) is a set of weighted edges between locations and time slots. If location li is visited at time slot sj, there will be an edge eij between them; otherwise, none. The weight wij is set to the frequency of location li visited during the time slot sj.

Definition 5 (Traffic Flow)

Traffic flow information is described by inflow and outflow. At a given time interval, inflow is the total number of traffic flows entering a region while outflow is the total number of traffic flows leaving a region. At time t, we use \(X_{t} \in \mathcal {R}^{N\times 2}\) to denote the traffic flow of N regions, where Xt[i,0] and Xt[i,1] are the inflow and outflow of region i at time t.

Problem 1 (Spatio-temporal Joint Prediction)

Given a set of trajectories \(\mathcal {T}=\{T^{1},T^{2},\cdots ,T^{|\mathbf {T}|}\}\), where \(T^{i}=\{{p^{i}_{1}},{p^{i}_{2}},\cdots ,{p^{i}_{m}}\}\) is the i-th trajectory, spatio-temporal joint prediction aims to predict the next spatio-temporal point \(p^{i}_{m+1}\) of the trajectory Ti, including the location identification \(l^{i}_{m+1}\) and the corresponding timestamp \(t^{i}_{m+1}\) derived from the time interval \(\tau ^{i}_{m+1}\).

4 Methodology

4.1 Overview

In Figure 2, we propose a graph-contextualized multitask learning model, which is made up of five parts, i.e., representation module, trajectory encoder, traffic encoder, context modeling, and task-specific decoder. In the model, we design a multitask framework which adds travel intention prediction as an auxiliary task on the basis of spatio-temporal joint prediction. Specifically, we construct three relational graphs and use graph-based embedding methods to embed each vertex into a low dimensional space. Then, we use self-attention network to model trajectories to obtain sequential information. Meanwhile, the traffic encoder utilizes spatio-temporal convolution network to capture traffic-related contexts. It is constructed by two ST-blocks, which combines temporal gated CNN with spatial graph convolution. Considering the influence of other background factors, a carefully designed context attention layer is used to adaptively assign different weights to sequential information and traffic information. Finally, the task-specific decoder makes the final predictions for each task and trains the entire network with multi-task losses.

Fig. 2
figure 2

Architecture of GCMT. It consists of representation module, trajectory encoder, traffic encoder, context modeling and task-specific decoder

4.2 Representation module

4.2.1 Bipartite graph embedding

Trajectory movement is affected by complicated geographical influences and temporal cycle dependencies among locations. For example, at 12 a.m., users usually head to restaurants for lunch and tend to visit nearby locations. Thus, we construct location-location graph, location-region graph and location-time graph to jointly learn spatio-temporal information of locations. We adopt a probabilistic model [31] to learn embeddings of heterogeneous graph nodes.

Given a bipartite graph \(\mathcal {G}_{AB}=(\mathcal {V}_{A}\cup \mathcal {V}_{B}, \mathcal {E}_{AB})\), the probability of observing a bipartite edge eij between \(v_{i}\in \mathcal {V}_{A}\) and \(v_{j}\in \mathcal {V}_{B}\) is computed as:

$$ p(e_{ij}=1) = \frac{1}{1+\exp(-({\overrightarrow{v}}_{j}^{\mathrm{T}}\cdot\overrightarrow{v}_{i}))} $$
(1)

where \(\overrightarrow {v}_{i}\) and \(\overrightarrow {v}_{j}\) are embedding vectors of vi and vj respectively. However, (1) is only applicable to a bipartite edge. For a weighted bipartite graph \(\mathcal {G}_{AB}\), its likelihood can be computed by:

$$ O_{AB} = -\sum\limits_{e_{ij}\in\mathcal{E}_{AB}}{w_{ij}\log{p(e_{ij}=1)}}-\sum\limits_{e_{ij}\in\overline{\mathcal{E}}_{AB}}{\gamma_{ij}\log(1-p(e_{ij}=1))} $$
(2)

Where \(\overline {\mathcal {E}}_{AB}\) is a set of negative edges, and γij is the weight of negative edge, wij is the weight of its positive edge.

Directly optimizing (2) will result in high computational cost, since it requires to calculate massive negative edges. Thus, we adopt a negative sampling method [32] to sample multiple negative edges for each positive edge, and the weights of negative edges are assumed to be equal to the weight of their corresponding positive edge. We reformulate the objective function as:

$$ O_{AB} = -\sum\limits_{e_{ij}\in\mathcal{E}_{AB}}{w_{ij}\Big{[}\log{p(e_{ij}=1)}}+\sum\limits_{k=1}^{M}{\mathrm{E}_{v_{k} \sim P_{n}(v)}\log(1-p(e_{ik}=1))}\Big] $$
(3)

where M is the number of negative edges, \(P_{n}(v)\propto d_{v}^{3/4}\), and dv is the degree of vertex v. We use asynchronous stochastic gradient algorithm (ASGD) [33] to optimize (3). To collectively embed our three relational graphs into a shared low-dimension space, we minimize the sum of all objective functions.

$$ O = O_{ll}+O_{lr}+O_{ls} $$
(4)

All edges in \(\mathcal {E}_{ll}\), \(\mathcal {E}_{lr}\) and \(\mathcal {E}_{lt}\) are firstly merged together. Considering that the weights of edges between different graphs are not comparable, we adopt the joint embedding training algorithm [31] to alternatively sample from the three sets of edges to update the model. Hence, we can obtain the embedding matrices of location \(M_{l}\in \mathcal {R}^{n_{1}*d}\), region \(M_{r}\in \mathcal {R}^{N*d}\) and time slot \(M_{s}\in \mathcal {R}^{n_{2}*d}\).

4.2.2 Trajectory embedding

Both temporal and spatial sequence can provide meaningful knowledge for trajectory modeling, and meaningful mobility patterns may exist in different spatial granularity. Thus, we embed not only spatio-temporal multi-sequence, but also spatial multi-granularity. For trajectory T, a sequence of base stations Tl = {l1,l2,⋯ ,lm} and a sequence of timestamps Tt = {t1,t2,⋯ ,tm} can be directly obtained. Besides, we derive coarse-grained region sequence Tr = {r1,r2,⋯ ,rm} from Tl for spatial multi-granularity modeling. To capture temporal information, a sequence of continuous points’ time interval, i.e., Tτ = {τ1,τ2,⋯ ,τm}, is further obtained. To obtain sequences with fixed-length, we adopt zero-padding at the end of them, e.g., Tl = {l1,l2,⋯ ,ln}, where n is the predefined maximum length. Then, we can retrieve the embedding of location sequence \(E_{l}\in \mathcal {R}^{n*d}\) through \(M_{l}\in \mathcal {R}^{n_{1}*d}\), where \(E_{l,i} = M_{l,{T_{l,i}}}\), that is, selecting the specified row’s vector from Ml according to the base station’s identification. Then, the embedding of time interval sequence \(E_{\tau }\in \mathcal {R}^{n*d}\) and region sequence \(E_{r}\in \mathcal {R}^{n*d}\) can be obtained in the same way.

4.3 Trajectory encoder

Considering that cellular trajectory has strong sequentiality and dense sampling points, self-attention network [34] is adopted to capture long-term dependencies of the whole sequence. Since multiple heads can jointly focus on different representation subspace information, we further use multi-head self-attention to encode trajectory sequences. The framework of the self-attention network (SAN) is shown as Figure 3(a). After obtaining the embedding of each sequence Etask, SAN is adopted to model the sequential information.

$$ S_{task} = SAN(E_{task}) $$
(5)

where task ∈{l,r,τ}. Besides, we take the last point’s representation of Stask as task-specific sequential representation, i.e., Sl,m,Sr,m,Sτ,m.

Fig. 3
figure 3

The framework of SAN and ST-block

4.4 Traffic encoder

Trajectory movement is greatly affected by traffic condition. For example, traffic congestion will make trajectory move slower, and thus affect the state of next movement. Hence, we aim to model complex spatio-temporal dependencies in traffic flows to consider traffic-related contexts. We select traffic flows from recent P time intervals, i.e., \(\mathcal {X}=[X_{t_{m}-P+1}, X_{t_{m}-P+2}, \cdots , X_{t_{m}}]\). \(\mathcal {X}\) will be fed into traffic encoder composed of two ST-blocks and a fully-connected layer. To extract spatial and temporal correlations, ST-block is formed by temporal gated CNN and spatial graph convolution as Figure 3(b) shows.

4.4.1 Temporal gated CNN

Although RNNs are widely used in time series modeling, its performance is limited by the non-parallel training procedures and time-consuming iterations. Inspired by the superiority of CNN [35], we adopt temporal gated CNN to capture temporal dynamics of traffic flows. Specifically, it takes historical traffic as input \(H\in \mathcal {R}^{P\times N \times c}\), where c is the number of input channels. The convolution operation with two convolution kernels \({{\varGamma }}_{1}, {{\varGamma }}_{2} \in [\mathcal {R}^{t_{k} \times c \times 1, o]}\) is adopted to integrate temporal information within tk steps, where o is the number of output channels. These two parts are input to gating mechanism to capture long-term temporal memory. The output \(H_{t}\in \mathcal {R}^{(P-t_{k})\times N \times o}\) is computed by:

$$ H_{t} = ({{\varGamma}}_{1} \otimes H) \odot \sigma({{\varGamma}}_{2} \otimes H) $$
(6)

where ⊙ denotes the element-wise product, and σ is the sigmoid function for controlling the propagation of temporal information [36].

4.4.2 Spatial graph convolution

As we all know, the current traffic condition of a region is influenced by not only its recent time period, but also its surrounding regions due to the geographical connection. Hence, graph convolution network is adopted to model the spatial dependencies between different regions.

Specifically, we treat different regions as the nodes in the graph, and the adjacency matrix A is computed according to the distances among regions.

$$ a_{ij}= \left\{\begin{array}{ccc} \exp\left( \frac{d_{ij}^{2}}{\sigma^{2}}\right) & , & if\ i\neq j\ and\ \exp\left( \frac{d_{ij}^{2}}{\sigma^{2}}\right)\geq\epsilon,\\ 0 & , & otherwise. \end{array} \right. $$
(7)

where aij is the weight of the edge eij, and dij is their distance. σ2 and 𝜖 are thresholds to control the distribution and sparsity of matrix A.

To integrate the temporal information, we use \(H_{t_{m}}\) obtained by the temporal gated CNN to initialize the representation of each region. Inspired by [37], we define the graph convolution operation in l-th layer as:

$$ H^{(l+1)}_{s}= \sigma(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}H^{(l)}_{s}W^{(l)}) $$
(8)

where \(H^{(l)}_{s}\) is the input of the l-th hidden layer, and \(H^{(0)}_{s}=H_{t}\). \(\hat {A}=A+I\), I is an identity matrix. \(\hat {D}\) is the diagonal node degree matrix of \(\hat {A}\) and \(\hat {D}_{ii} ={\sum }_{j}{\hat {A}_{ij}}\). W(l) is trainable parameters, and σ(⋅) is a non-linear activation function (e.g., ReLU). Since a graph convolution layer can aggregate information from 1-hop neighbors, we stack k layers to expand the receptive field and gain information from k-hop neighbors.

Finally, we further use a fully-connected layer to obtain the current traffic condition \(H_{t_{m}}\) and extract the information of current region rm to represent traffic-related contexts, i.e., \(H_{c}=H_{t_{m},r_{m}}\).

4.5 Context modeling

So far, we have obtained trajectory sequence information and traffic-related information. Intuitively, trajectories under different backgrounds will have different mobility patterns, and the effect of different information may be different. For example, traffic contexts may have a greater impact on trajectories of traffic congestion, while sequence information may have greater effects on daily trajectories. Hence, a context attention mechanism is proposed to model the influence of background factors.

After trajectory and traffic encoders, we obtain two vectors Stask,m and Hc for each task, and we respectively project these vectors to new query vectors ps and pc. Besides, we concatenate all the background factors together as context embedding ec for trajectory T. Then, the attention coefficients are measured by computing the similarity between query vector and ec as follows:

$$ a_{i} = softmax(p_{i}e_{c})= \frac{\exp(p_{i}e_{c})}{{\sum}_{i}{\exp(p_{i}e_{c})}} $$
(9)

where i ∈{s,c}. Then we can get a new vector integrating trajectory sequence information and traffic context information based on the attention coefficients.

$$ Z_{task} = \sum\limits_{i}{a_{i}p_{i}} $$
(10)

where task ∈{l,r,τ}. Thus, we can effectively fuse trajectory sequential information and traffic information based on the trajectory’s background factors.

4.6 Task-specific decoder

4.6.1 Fusion layer

After obtaining the sequential information and intentional information from trajectories, an effective fusion method is required to aggregate these two information. In order to consider the importance of each information for trajectory prediction, we design a gating mechanism to ensure that the fusion representation can remain both knowledge with different proportions, and it tends to focus more on the more informative features for current movement. The gating mechanism can be written as follows,

$$ \begin{array}{@{}rcl@{}} G_{task,1} &=& sigmoid(W_{task,1}Z_{task}+W_{task,r,1}Z_{r}+b_{task,1})\\ G_{task,2} &=& sigmoid(W_{task,2}Z_{task}+W_{task,r,2}Z_{r}+b_{task,2})\\ G_{task} &=& G_{task,1} \odot Z_{task} + G_{task,2} \odot Z_{r} \end{array} $$
(11)

where Wtask,1,Wtask,r,1,btask,1,Wtask,2,Wtask,r,2,btask,2 are learnable parameters, and task ∈{l,τ}.

4.6.2 Prediction layer

Next location predictor

It predicts the next location lm+ 1 based on the final representation Gl, and the probability of lm+ 1 is calculated as follows:

$$ P_{l} = softmax(G_{l}W_{l}+b_{l}) $$
(12)

where Wl and bl are learnable parameters.

Intention predictor

Similar to the next location predictor, the probability of intention I is also calculated based on the intentional representation Sg,m.

$$ P_{r} = softmax(Z_{r}W_{r}+b_{r}) $$
(13)

Switch time predictor

We adopt TPP to model time sequences to make time prediction. Inspired by the previous work [10], we use the output of deep neural network to calculate the density function f(t), which calculates the probability that next location switch occurs at time t given T.

$$ \begin{array}{@{}rcl@{}} f^{*}(t) &=& \exp\{{W_{t}^{T}}G_{\tau}+W_{\tau}(t-t_{m})+\lambda_{0}+\frac{1}{W_{\tau}}\exp({W_{t}^{T}}G_{\tau}+\lambda_{0})\\ &&-\frac{1}{W_{\tau}}\exp({W_{t}^{T}}G_{\tau}+W_{\tau}(t-t_{m})+\lambda_{0})\} \end{array} $$
(14)

where Wt,Wτ,λ0 are trainable parameters. The next switch time is computed as \(t_{m+1} = {\int \limits }^{\infty }_{t_{m}}tf^{*}(t)\), which can be calculated with numerical integration [38].

4.6.3 Loss layer

Location loss

We apply multi-class logarithmic loss function cross entropy as our location loss function, which is calculated by:

$$ \mathcal{L}_{l} = -\hat{l}_{m+1}\log{P_{l}} $$
(15)

where \(\hat {l}_{m+1}\) is the one-hot represented ground truth and Pl is the predicted probability distribution of each base station.

Time loss

We define the loss function for the next switch time prediction task based on the definition of TPP as follows:

$$ \mathcal{L}_{\tau} = -\log{f^{*}(t_{m+1})} $$
(16)

where f(tm+ 1) is the density function, which is calculated by (14).

Intention loss

Considering that traditional multi-class loss function treats multiple categories as independent individuals, and thus cannot capture the complicated spatial associations among different regions, such as distance and direction consistency. Thus, a distribution-aware loss function is designed to effectively capture the intention information.

$$ \mathcal{L}_{r} = \frac{{\sum}_{i\in {topk\_r}}{D_{r_{j},i}\cdot{P_{r,i}}}}{{\sum}_{i\in {topk\_r}}{P_{r,i}}} $$
(17)

where topk_r is a set of regions at the top k of Pr; \(D_{r_{j},i}\) is the distance between the ground truth rj and ri; j = m + m ∗ 0.5, m is current trajectory’s length; Pr,i is the predicted probability of future goal at the i-th region.

Multi-task loss

The entire network is trained by minimizing the weighted loss sum of location prediction, time prediction and intention prediction.

$$ \mathcal{L}({{\varTheta}}) = \beta_{\tau}\mathcal{L}_{\tau} + \beta_{r}\mathcal{L}_{r} + (1-\beta_{\tau}-\beta_{g})\mathcal{L}_{l} $$
(18)

where Θ are all learnable parameters; βτ and βr are hyper-parameters for tuning relative influence of \({\mathscr{L}}_{\tau }\) and \({\mathscr{L}}_{r}\).

5 Experiments

5.1 Datasets

Table 1 shows the statistics of two real cellular trajectory datasets, which are respectively collected on June 29, 2019 in Chengdu and on June 9, 2019 in Xiamen. A base station is viewed as a location that can provide signals to the area around it. If a user enters the area, the base station identifies the user and records the corresponding time. A trajectory is a time-ordered spatio-temporal point sequence in a taxi’s trip order. We remove the trajectories with less than 5 points and take half length of each trajectory as its input length.

Table 1 Statistics of two datasets

5.2 Baselines

To evaluate the effectiveness of GCMT for two main tasks, we respectively compare it with single-task methods for location prediction (STRNN, DeepMove, HST-LSTM, and Flashback), single-task methods for time prediction (Avg, THP) and multi-task methods (RMTPP, IRNN, ARNPP-GAT, IAMT).

  • STRNN [7]. It uses distance-specific and time-specific matrices to extend standard RNN framework for location prediction.

  • DeepMove [5]. It combines a historical attention mechanism with GRU to predict the next location over lengthy and sparse trajectories.

  • HST-LSTM [6]. It adopts an add operation on three existing gates of LSTM to consider geographic distance and time interval.

  • Flashback [24]. It does flashbacks on past hidden states of RNN to consider historical points with similar context for next location prediction.

  • Avg. It takes the average value of historical spatio-temporal points’ time interval as the predicted result.

  • THP [39]. It introduces a transformer-based architecture into Hawkes process to make time predictions for event sequence.

  • RMTPP [10]. It utilizes RNN to construct intensity function of recurrent point process, which is used for events’ spatio-temporal joint prediction.

  • IRNN [11]. It adopts two unshared RNNs to respectively model the event and time sequence, and combines with TPP for the joint prediction.

  • ARNPP-GAT [27]. It utilizes GAN to model user long-term preferences and adopts attention-based recurrent point process for next check-in inference. GAN is removed due to the lack of user social graph in our datasets.

  • IAMT [12]. It adds travel intention prediction as an auxiliary task to provide long-term mobility information for spatio-temporal joint prediction.

5.3 Parameter setup and metrics

All methods are implemented with PyTorch. We randomly select 70% of our datasets for training, and the remaining 10% and 20% for validation and test. For parameter setup, M is set as 5 and kernel size of CNN is 3. We select the previous 6 intervals traffic flow data (60 minutes). Other settings are the same as IAMT [12]. To evaluate the performance for location prediction, we use four widely metrics: Accuracy (ACC), Mean Reciprocal Rank (MRR), Recall, and macro-F1. Besides, Mean Absolute Error (MAE) and Root Mean-Squared Error (RMSE) are used to evaluate models for time prediction. To be fair, each method is run three times and the average value is taken as the final result.

5.4 Performance comparison

5.4.1 Next location prediction

From the results in Table 2, we can find that our model achieves the best performance among all baselines. Specifically, traditional single-task based methods (STRNN, DeepMove, HST-LSTM, and Flashback) perform worse than our model GCMT. The reason is that the mutual influence of spatio-temporal signals in a trajectory is neglected in these methods. In addition, GCMT performs better than multi-task methods (RMTPP, IRNN, ARNPP-GAT, and IAMT), which is due to the fact that they ignore complex spatio-temporal contexts. The superior of our model proves that trajectory movement is also influenced by complicated spatio-temporal dependencies and various context information. Besides, Hangzhou dataset has more base stations and fewer records than Xiamen, and thus it is more sparse and difficult to train a good model to learn trajectory movement, which results in the worse performance in Hangzhou.

Table 2 Performance comparison results for next location prediction

5.4.2 Next switch time prediction

Table 3 shows that our model achieves the best performance. In detail, GCMT improves the performance of single-task methods (Avg and THP), because it jointly utilizes spatio-temporal signals for collaborative trajectory modeling. Besides, although multi-task baselines (RMTPP, IRNN, ARNPP-GAT and IAMT) utilize both spatio-temporal signals, they ignore the travel speed of a trajectory will be affected by spatio-temporal dependencies of locations and traffic-related contextual information, thus performing worse than GCMT.

Table 3 Performance comparison results for next switch time prediction

5.5 Ablation study

We compare GCMT with variants to study the usefulness of each component.

  • GCMT-E removes the graph embedding, which uses a lookup layer to transform one-hot vectors of sequence into dense vector representations.

  • GCMT-T removes the traffic encoder, which ignores the traffic-related contextual information.

  • GCMT-C removes context attention mechanism, which sums trajectory sequence representation and traffic context representation.

As shown in Tables 2 and 3, all variants perform better than baselines, which confirms the advantage of our methods. In detail, GCMT-E performs worse than GCMT, indicating that graph-based embedding can effectively extract geographical and temporal cyclic effect. The comparison of GCMT-T and GCMT proves that traffic contexts are important for accuracy spatio-temporal joint prediction. Besides, GCMT outperforms GCMT-C, which shows that the context attention mechanism can fuse sequence information and traffic information according to trajectory’s various background factors. GCMT successfully achieves the best result by utilizing complicated spatio-temporal contexts.

5.6 Effect of different grid granularity

Figure 4 shows the performance of GCMT with grid granularity ω from {50,100,150,200,250} on next location prediction task. We can see that on Hangzhou dataset, when ω is less than 100, the performance increases as the number increases, because the direction reflected by future region is too large to provide useful information. Besides, when ω is greater than 100 on Hangzhou and 50 on Xiamen, increasing the number may result in a slight performance degradation, because the future region is too small to be accurately predicted and the predicted direction may have large deviations. Finally, the grid granularity is set as 100 on Hangzhou and 50 on Xiamen.

Fig. 4
figure 4

Effects of different grid granularity

6 Conclusion

In this paper, we study the spatio-temporal joint prediction for cellular trajectories and propose a graph-contextualized multitask learning method which can learn the complicated spatio-temporal context information. Specifically, we introduce graph embedding module to utilize geographical influence and temporal cyclic effect to obtain meaningful representation for each location. To capture traffic-related contextual information, we combine temporal gated CNN and spatial graph convolution to learn the dynamic of traffic flows. In addition, a context attention mechanism is well-designed to fuse sequence information and traffic information according to trajectory’s background factors. Finally, extensive experiment results on two real trajectory datasets have verified that GCMT can achieve accuracy spatio-temporal joint prediction.