1 Introduction

Recent advancements in computer vision have greatly improved the efficiency of traffic control at the city level by enabling accurate prediction and analysis of high volumes of traffic. Including a vehicle tracking application is a crucial element in implementing intelligent traffic management systems. A vehicle tracking application combines the vehicle’s spatial, temporal, and visual data to generate its trajectory. It can be utilized to monitor the path of cars within the urban area and ascertain their velocity and travel duration to enhance traffic efficiency. Multi-Target Multi-Camera Tracking (MTMCT) is a significant application in this domain. The objective of MTMCT is to generate a comprehensive global trajectory of a vehicle by extracting its trajectory from cameras positioned at various locations across a significant area. Tracking-by-detection is a fundamental paradigm in MTMCT, which consists of the following three components: (i) object detection, (ii) multi-object tracking in single cameras (MOT), and (iii) trajectory clustering. Object detection involves identifying objects as bounding boxes within video frames. Multiple Object Tracking (MOT) subsequently monitors the object’s motion in a single camera by matching bounding boxes of consecutive frames and generating tracklets. Ultimately, Trajectory clustering generates a global object activity map by combining tracklets from multiple cameras. In online MTMCT, the tracking task can function by linking objects using only past frames, while in offline MTMCT, it can operate by contemplating future and past frames as well. The tracking-by-detection paradigm has garnered considerable recognition and has demonstrated encouraging outcomes (Yao et al., 2022; Nguyen et al., 2023).

However, despite its fascinating design and impressive performance, the use of tracking-by-detection schema in real-world scenarios presents several challenges: (1) Non-generative: it is necessary to develop a novel matching rule (spatio-temporal) for every new camera scenario; (2) Data limitation: To the best of the authors’ knowledge, the public MTMCT dataset is currently restricted, with only the Cityflow dataset (Tang et al., 2019) available; (3) High cost of the manual labeling: The fine-tuning procedure requires extensive time and effort to label the dataset manually. In response to the challenges above, we propose a generative end-to-end transformer-based MTMCT model called LaMMOn. Because it is an end-to-end model, LaMMOn is easily applied to different camera scenarios without the need to build new matching rules. Furthermore, LaMMOn addresses the issue of data limitation and the high cost of manual labeling by synthesizing object embeddings from text and utilizing these synthesized embeddings to fine-tune new datasets. The LaMMOn ’s architecture is visualized in Fig. 1.

In general, LaMMOn contains three modules: (1) Language Model Detection module (LMD) is responsible for performing the (i) object detection task and generating object embedding; (2) Language and Graph Model Association module (LGMA) handles the tasks of (ii) multi-object tracking in single camera (MOT), and (iii) trajectory clustering simultaneously; and (3) Text-to-embedding module (T2E) to synthesize the object embedding from texts which identify object feature such as: car type, car color and location. Fig. 1 presents the overview of LaMMOn architecture and Sect. 3 clarifies the methodology details.

Fig. 1
figure 1

Overview of the LaMMOn architecture

The primary contributions of our paper are highlighted below:

  • We propose a generative end-to-end MTMCT method called LaMMOn that effectively adapts to diverse traffic video datasets without the need for manual rule-based matching or manual labeling.

  • We propose a T2E module that can synthesize object embedding from text, solving the data limitation problem of the MTMCT problem.

  • We propose the LGMA module to address the tasks of (ii) multi-object tracking in single cameras (MOT), and (iii) trajectory clustering simultaneously. LGMA integrates both language and graph models to enhance the performance of the association. It introduces a novel perspective on the issue of object detection.

  • Our online MTMCT application achieves a competitive result on many datasets: CityFlow (Tang et al., 2019) (HOTA 76.46%, IDF1 78.83%), I24 (Gloudemans et al., 2024) (HOTA 25.7%) and TrackCUIP (Harris et al., 2019) (HOTA 80.94%, IDF1 81.83%) with an acceptable FPS (from 12.20 to 13.37) for an online application.

The rest of the paper is structured as follows: related work is explained in Sects. 23 presents our proposed architecture and methodology; Sect. 4 presents our datasets, evaluation metrics, experimental setups, results, and some ablation studies; Sect. 5 concludes with future work.

2 Related work

2.1 Multi-object tracking (MOT)

Multi-Object Tracking (MOT) is the process of associating objects observed in video frames from a single camera and creating tracklets for the detected objects. Multiple efficient techniques have been proposed for MOT, for example: Tracktor (Bergmann et al., 2019) offers proposals to a detector in the form of tracks and directly transfers the tracking ID. The object association strategy suggested in the CenterTrack algorithm (Zhou et al., 2020) relies on comparing the predicted positions of objects and their matching detections within established tracks. The TransCenter framework, as introduced in Xu et al. (2021), enhances the functionalities of CenterTrack by including the deformable DETR (Zhu et al., 2020). The recent work from Hassan et al. (2023) combines the Siamese network and Deepsort to extract the features for the tracking.

2.2 Multi-target multi-camera tracking (MTMCT)

Multi-Target Multi-Camera Tracking (MTMCT) aims to build a global tracklet that tracks the object movement in multiple cameras. To our knowledge, all previous techniques have treated the MTMCT problem as a subsequent stage of the MOT problem. Typically, it is seen as an issue of clustering, where the input results from trajectories obtained from the MOT problem. In a prior study (Yao et al., 2022), the clustering step has incorporated spatial-temporal filtering and traffic laws. By imposing these constraints, the scope of the search is significantly narrowed, resulting in a substantial enhancement in the accuracy of vehicle re-identification. Using the identical camera distribution for both test data and training data, techniques (Ullah & Alaya Cheikh, 2018) acquire the transition time distribution for each pair of adjacent cameras without manual adjustment. The paper (Tesfaye et al., 2017) presents various methods for MTMCT with non-overlapping views.

Unlike previous MTMCT approaches, LaMMOn is an end-to-end model that simultaneously performs the tasks of detection and association for multiple cameras at the same time. The methodology is described in detail in Sect. 3.

2.3 Transformer in tracking

Transformer-based models are recently being extensively utilized in several domains that require image processing approaches (Sun et al., 2020; Ghaffar Nia et al., 2023; Chohan et al., 2023). In the MTMCT problem, many studies have been employing the transformer model to improve performance. Trackformer (Meinhardt et al., 2022) enhances the DETR model by incorporating extra object inquiries from existing tracks and propagating track IDs, similar to the approach used in Tracktor (Bergmann et al., 2019). TransTrack (Sun et al., 2020) employs past track information as queries and establishes associations between objects using updated bounding box locations. Furthermore, MO3TR (Zhu et al., 2022) incorporates a temporal attention module to modify the state of each track within a specific time frame. It then utilizes the updated track characteristics as queries in DETR. The underlying concept of these works involves utilizing the object query method in DETR (Carion et al., 2020) to progressively expand existing tracks on a frame-by-frame basis. Our utilization of transformers differs. The transformer-based detector, known as LMD, uses queries to identify many objects as bounding boxes simultaneously. Next, we employ an additional transformer-based module called LGMA to group the previously discovered boxes into global trajectories.

2.4 Graph neural network in tracking

Utilizing neural networks to handle data with a graph structure was the initial application of GNNs (Gori et al., 2005). The core idea is to design a graph with interconnected nodes and edges and to update node/edge properties based on these interconnections. In recent years, various GNNs (e.g., GraphConv, GCN, GAT, GGSNN) have been proposed, each with a distinct feature aggregation rule that has been demonstrated to be effective on a variety of transportation tasks (Weng et al., 2020; Kumarasamy et al., 2023; Li et al., 2020; Khaleghian et al., 2023; Duan et al., 2019). Specifically, in GNN3DMOT (Weng et al., 2020), the authors design an unweighted graph in which each node represents an object feature at a particular frame, and each edge between two nodes at different frames represents the matching between detections. Graph-based methods in Duan et al. (2019) establish a global graph for multiple tracklets in different cameras and optimize for an MTMCT solution. Recently, the works in Nguyen et al. (2023, 2023a) built tracklet features in graph structures and used graph similarity to cluster the single-camera tracklet.

3 Methodology

LaMMOn can be partitioned into three modules: (1) The Language Model Detection module (LMD) performs the task of detecting objects and generating object embeddings; (2) The Language and Graph Model Association module (LGMA) handles multi-object tracking in single cameras and trajectory clustering at the same time; (3) The Text-to-embedding module (T2E) synthesizes object embeddings from texts, identifying object features such as car type, car color, and location. The general architecture is depicted in Fig. 1. To begin with, the video frame input is combined with Positional ID embedding and Camera ID embedding. Subsequently, the LMD accepts the concatenated embedding as input and generates the proposal object embedding for objects in each frame. The LGMA module then utilizes the object embedding and information from the memory buffer and filter module to produce the global tracklet. In addition, we use the synthesizer module to enhance the embedding of the synthesized proposal, addressing the challenge of limited data and ultimately improving the final outcome. In the subsequent part, we introduce the preliminaries and meticulously examine the intricacies of each module.

3.1 Preliminaries

In this section, we will formally define object detection, tracking, tracklet, and tracking schema.

Object detection. Consider a picture denoted by I. Object detection aims to accurately recognize and precisely determine the location of all interested objects in image I. An object detection module receives image I as input and generates a collection of objects {\(o_i\)} with their respective locations {\(b_i\}, b_i \in \mathbb {R}^4\) as output. If the objects are of several classes, the detector will generate a classification score \(s_i \in \mathbb {R}^C\) for a predetermined set of classes C. Our model focuses on only the object car, and then the classes C represent several automobile types, such as SUV, Sedan, or simply truck. In addition, our model generates a classification score for the color of the car and the camera ID. To summarize, our object detector produces the following outputs: \(\{ c,b_1, b_2, b_3, b_4, s_1, s_2 \}\), where c represents the camera ID, \(\{ b_1, b_2, b_3, b_4\}\) represent the location of the automobile, and \(\{s_1, s_2 \}\) represent the kind and color of the car.

Tracking and Tracklet. Consider a sequence of images denoted by \(I^1, I^2,..., I^T\). An MTMCT model aims to identify and trace the tracklet \(\tau _1,\tau _2,..,\tau _k\) of all objects within a certain period. Each tracklet, denoted as \(\tau _k = \{\tau _k^1,\tau _k^1,...,\tau _k^T\}\), represents a sequence of object locations and classification scores \(\tau _k^T = \{ c,b_1, b_2, b_3, b_4, s_1, s_2 \}\) for a particular object over each frame.

Tracking schema. This study decomposes the tracking problem into per-frame object detection and multi-camera inter-frame object association. Specifically, LMD handles per-frame object detection, and LGMA is responsible for multi-camera inter-frame object association. In per-frame object detection, LMD first finds \(N_t\) candidate objects \(\{ o_1^t, o_2^t,...o_{N_t}^t\}\) as a set of location and classification scores \(\{ c,b_1, b_2, b_3, b_4, s_1, s_2 \}\). Then, in multi-camera inter-frame object association, LGMA links current detected objects \(\{ o_1^t, o_2^t,...o_{N_t}^t\}\) to existing tracklets \(\tau _1,\tau _2,..,\tau _k\) and updates their status. Previous studies often established the association by considering pairwise matches between objects in consecutive frames (Bewley et al., 2016; Zhou et al., 2020) or by employing an optimization for global association (Frossard and Urtasun 2018; Brasó and Leal-Taixé 2020). Recently, GTR (Zhou et al., 2022) achieved single-pass joint detection and association in an end-to-end fashion. However, GTR only tracks the target in a single camera (video). Following this motivation, our model can carry out the end-to-end joint detection and association across several cameras in a synchronized manner. All video cameras are streamed and simultaneously perform object detection and object association in a single forward pass through the network.

3.2 Language model detection (LMD)

Fig. 2
figure 2

LMD module architecture

The architecture of LMD is depicted in Fig. 2. The LMD functions as a per-frame object detector, generating a collection of detections denoted as \(\{c, b_1, b_2, b_3, b_4, s_1,s_2 \}\). In this collection, the variable c represents the camera ID, while \(\{ b_1, b_2, b_3, b_4\}\) corresponds to the location of the automobile. Additionally, \(\{s_1, s_2 \}\) represent the type and color of the car, respectively. Following the concepts presented in pix2seq (Chen et al., 2021), we consider the representation of car location prediction as discrete tokens. Specifically, {x,y,w,h} in normal object detection are normalized to a number of bins (in this study, we set the number of bins as 1000) to produce \(\{ b_1, b_2, b_3, b_4\}\). The methodology employed in this study to generate these tokens is based on the architecture of Deformable DETR (Zhu et al., 2020) with encoder and decoder layers. Firstly, the video frame image is passed through a ResNet50 layer, after which it is integrated with both a Positional_ID encoder and a Camera_ID encoder. Subsequently, the aggregated feature map is sent toward the encoder and decoder layers of Deformable DETR (Zhu et al., 2020), generating the object embeddings. The object embeddings are fed into a feed-forward network (FFN) to generate the classification scores \(\{c, b_1, b_2, b_3, b_4, s_1, s_2 \}\).

Loss function. By the methodology proposed in the Deformable DETR (Zhu et al., 2020), our approach incorporates three distinct loss functions: cross-entropy loss, bounding box f1 loss, and bounding box IOU loss. It should be noted that the cross-entropy loss impacts both the bounding box F1 loss and the bounding box IOU loss. This is because the cross entropy loss incorporates the cross entropy loss of the four location predictions, which can also be used to calculate the bounding box F1 loss and the bounding box IOU loss. Specifically, the cross-entropy loss is computed as the sum of seven cross-entropy, each of which is weighted. In summary, the loss is calculated as follows:

$$\begin{aligned} loss = \alpha _1 * ce + \alpha _2 * bb\_f1 + \alpha _3 * bb\_IOU \end{aligned}$$
(1)

where the set {\(ce, bb\_f1, bb\_IOU\)} represents three types of loss functions, namely cross-entropy loss, bounding box f1 loss, and bounding box IOU loss. The weights of these losses are denoted as \(\alpha _1, \alpha _2, \alpha _3\).

Next, the variable ce is computed using the following formula:

$$\begin{aligned} ce = \beta _1 * c + \beta _2 * b_1 + \beta _3 * b_2 + \beta _4 * b_3 + \beta _5 * b_4 + \beta _6 * s_1 + \beta _7 * s_2 \end{aligned}$$
(2)

where \(\{c, b_1, b_2, b_3, b_4, s_1, s_2 \}\) are cross entropy of the predictions of camera ID \(\{c\}\), car location \(\{ b_1, b_2, b_3, b_4\}\), car’s type and car’s color \(\{s_1, s_2 \}\). And {\(\beta _1, \beta _2,\beta _3,\beta _4,\beta _5,\beta _6,\beta _7\)} are their weights.

Camera ID Encoder (CamEnc). Different from previous approaches that only consider camera IDs as learnable parameters or fixed one-hot vectors, our framework encodes camera features through the graph structure between adjacent cameras in a local area or route. Specifically, we build small subgraphs for each group of neighboring cameras, which are part of the global graph containing all the neighbor relationships of cameras in the datasets. We use state-of-the-art graph neural networks such as GCN (Welling and Kipf 2017), GIN (Xu et al. 2019), and GAT (Veličković et al. 2018) to extract node embeddings on the constructed global graph. The embeddings are used as camera features before aggregation with the positional encoding of the position ID and Resnet50 encoding on each frame t of the LMD component. This approach allows camera embeddings to represent not only the camera IDs’ information like previous methods but also capture the relationship structure between cameras in geographical space, which graph structure is visualized in Fig. 3. A comprehensive analysis is carried out in the Appendix to examine the performance of CamEnc in complete graphs (Scenarios 1, 2, 3, and 4-5) and a path graph (Scenario 6).

Fig. 3
figure 3

Camera graph based on geographical location of different scenarios. The red arrows indicate the location and direction of the cameras.

3.3 Language and graph model association (LGMA)

Fig. 4
figure 4

LGMA module architecture. The input for LGMA is , and the output is the . The component of LGMA is marked . Hence, the FFN, T2E, Filter modules do not belong to LGMA (Color figure online)

The architecture of LGMA is shown in Fig. 4. LGMA conducts the multi-camera inter-frame object association task in an end-to-end fashion. The module takes as input objects embedding from LMD and existing tracklets \(\tau _1,\tau _2,..,\tau _k\) from the Memory Buffer. It then links these embedded objects in related tracklets via Graph-Based Token Features to determine their updated status in the current frames. Finally, it updates the status of all the existing tracklets \(\tau _1,\tau _2,..,\tau _k\) in the Memory Buffer.

Graph-Based Token Feature Construction. We create a graph based on token features to leverage the association information in token embeddings generated from LMD, with the nodes representing feature vectors (token embeddings) and the edges representing their Euclidean distance, which graph generating process presented in Fig. 5. We keep only the edges greater than the threshold value \(\tau\). The decision to set the distance threshold \(\tau\) to 0.5 was determined by an empirical tuning proposed in the previous approaches (Fisichella, 2022; Nguyen et al., 2023). Similar to LMD, in LGMA architecture, graph neural networks such as GCN (Welling and Kipf 2017), GIN (Xu et al. 2019), and GAT (Veličković et al. 2018) are also used to extract node embedding features from these token feature graphs. These node embeddings are then aggregated with the object embeddings generated by LMD. This combination ensures that final embeddings not only contain token information representing tracklet features but also represent the correlation between tokens in the vector space through combining node embeddings of the token feature graphs. Furthermore, GNN functions as a tool to refine the acquired T2E weights. Due to the T2E being trained independently, its weight remains fixed throughout the training of the LaMMOn model. Subsequently, it is necessary to perform a fine-tuning using a GNN to augment the module’s adaptability. Besides, since the generated graphs maintain small structures with less than 50 nodes and edges, the computational time to get node embeddings is almost negligible. Thus, this approach can improve the overall performance of the proposed architectures (see Sect. 4) but still guarantees the model’s prediction and operability for online scenarios, which always require real-time processing.

Fig. 5
figure 5

Graph-based token feature visualization: a A tracklet consisting of three bounding boxes collected from various video frames; b Tokens generated by LMD from the input tracklet; c LGMA-constructed graphs in which nodes are token features (e.g., tokens of the first bbox are: Cam2, loc120, loc126, loc201, loc250, Pickup, and Blue) and edges represent the Euclidean distance between nodes, with the darker the edges, the greater the distance (Color figure online)

Global Association. Adopting the design proposed by GTR (Zhou et al., 2022), LGMA carries out the task of associating multiple frames simultaneously. The object embedding is regarded as the input for the encoder, while the existing tracklets are regarded as the input for the decoder. More precisely, the object embedding is passed through self-attention layers and then through two linear/ReLU layers. Then, it proceeds to cross-attention layers as key and value (K, V).

On the other hand, the existing tracklet is used as the query (Q) input for the cross-attention layers. In the end, matrix multiplication is performed to compute the association score between the object embedding and the existing tracklets. The encoder-decoder design bears a resemblance to GTR (Zhou et al., 2022); however, based on our ablation study shown in Sect. 4, both the encoder and decoder consist of 2 layers (GTR set number of layers for both encoder and decoder as 1). The other parameter remains consistent with the original GTR model (Zhou et al., 2022). The process is depicted in Fig. 4.

Memory Buffer and Filter Module. The memory buffer stores the object embeddings of all existing tracklets. The stored embedding is subsequently utilized to construct the tracklet representation, which serves as the query input for the decoder in the global association. Various methods can be employed to propose a presentation for a tracklet, including the most recent embedding, mean, logarithmic mean, or attention-based approaches (Cai et al., 2022). An ablation study is conducted in Sect. 4. To ensure a straightforward and efficient model, we utilize the average of the five most recent embeddings as the representation for the tracklet. However, employing the representation of every tracklet that currently exists may be imprudent and less efficient. The filter module is then designed to receive the Camera ID c of the current inputs and selectively choose just the representation of the tracklet in the adjacent camera as the query input for the decoder in the global association process.

3.4 Text to encoder (T2E)

T2E is specifically developed to address the problem of data limitations and the exorbitant expenses associated with human labeling. The primary concept is that instead of synthesizing video, we instruct an encoder (T2E) to generate the synthesized object embeddings using defined text tokens representing the object features \(\{ c,b_1, b_2, b_3, b_4, s_1, s_2 \}\). Specifically, the Sentencepiece encoder (Kudo & Richardson, 2018) is employed as the main architecture, inspired by the Unified-IO approach (Lu et al., 2022). The T2E training is performed once, independently from LaMMOn. The input consists of text that identifies the object features \(\{ c,b_1, b_2, b_3, b_4, s_1, s_2 \}\), whereas the training objective/ output is the object embedding obtained by LMD. Subsequently, the parameter of T2E remains unchanged when utilized to generate synthesized object embeddings. The synthetic object embeddings serve the same function as the real object embeddings obtained from the LMD result using real data (videos). More precisely, these synthetic representations are then employed to train the LGMA module (with frozen LMD parameters) The efficacy of T2E is showcased in Sect. 4 where we must assess LaMMOn ’s performance in the test set (Scenario 6) of the CityFlow dataset while no label of this test set is given. So, we employ the T2E module to generate object embeddings for Scenario 6 and utilize them to fine-tune the LGMA module. We conduct an additional experiment in the appendix to demonstrate the efficacy of T2E in managing novel scenarios derived from other datasets. In addition, we provide a comprehensive and systematic explanation of how to use the T2E model in a novel scenario in the appendix. Furthermore, we provide an extensive evaluation to assess the similarity between real tokens and synthetic tokens in the appendix.

4 Experiment

We publicize the datasets and our LaMMOn model at https://github.com/elituan/lammon.

4.1 Dataset, implementation details and evaluation metric

We assess our technique by conducting experiments on three MTMCT tracking datasets: CityFlow (Tang et al., 2019), I24 (Gloudemans et al., 2024), and TrackCUIP (Harris et al., 2019).

The CityFlow dataset covers different types of streets, including intersections, highways, and road extensions. It comprises 3.25 h of traffic videos captured from 40 cameras at 10 intersections. The CityFlow test set consists of 20 min of street video from six cameras situated at six intersections. We extract the car’s kind and color for CityFlow using the label from CityFlow-NL dataset (Feng et al., 2021).

The I24 dataset comprises 234 h of video recordings captured simultaneously from 234 overlapping HD cameras along a 4.2-mile section of an 8-10 lane interstate highway close to Nashville, TN, US. We utilize the ImageNet pre-trained Res2Net (Gao et al., 2019) model to extract the vehicle color for the I24 dataset.

The TrackCUIP is a private dataset carried out under the TestBed CUIP environment (Harris et al., 2019). The dataset comprises one-hour videos of traffic recordings captured by four cameras positioned at four different crossroads. The TrackCUIP dataset utilizes 30 min of videos for training, with 10 min allocated for validation and 20 min assigned as the test set.

We train our models for a total of 65 epochs. The details of the parameters are discussed in Sect. 4. The training experiments use four Nvidia Tesla V100s with 32GB of memory each, whereas the inference experiments use one Nvidia Tesla V100 with 32GB of memory.

To assess our model’s performance, we employ the IDF1 (Ristani et al., 2016) and HOTA (Higher Order Tracking Accuracy) (Luiten et al. 2021) metrics. The detailed formula and explanation of these two metrics are provided in the Appendix.

4.2 Baselines

Due to the limited availability of public approaches for online MTMCT, we have chosen three baseline methods, namely TADAM (Guo et al. 2021), BLSTM-MTP (Kim et al. 2021), and GraphBased Tracklet (Nguyen et al. 2023), to showcase the effectiveness of our LaMMOn model. TADAM is a model that combines position prediction and embedding association synergistically. To be more precise, the prediction process involves utilizing attentional modules to allocate greater focus towards targets and minimize attention towards distractors. These dependable embeddings can enhance the experience of identity awareness by aggregating memories. For BLSTM-MTP, their primary objective is to address the issue of efficiently considering all tracks in memory updating while minimizing spatial overhead. They achieve this by implementing a unique multi-track pooling module. Regarding GraphBased Tracklet, it is constructed by representing the tracklet feature as a graph structure and employing graph similarity scores to match tracklets captured by multiple cameras.

In addition, for the I24 dataset, we reuse the baselines’ results from the original dataset paper (Gloudemans et al. 2024) including SORT, IOU, KIOU, ByteTrack (L2), ByteTrack (IOU).

4.3 Experimental results

In this section, we assess our methodology using three datasets: CityFlow (Tang et al. 2019), I24 (Gloudemans et al. 2024), and TrackCUIP (Harris et al. 2019), which were described in Sect. 4.1.

4.3.1 CityFlow dataset

We conduct multiple ablation studies on the CityFlow Dataset to optimize the parameters for LaMMOn. Due to the page limit, we only present an ablation study for tuning parameters of increasing FPS and the T2E module; more details are given in the Appendix. The best IDF1 after tuning parameters of LMD and LGMA is 77.32%. However, the FPS is only 6.3, which is insufficient for an online MTMCT application. To increase the FPS, we conducted an ablation study with three parameters:

  • num_lay_LMD: The number of layers is used in the encoder and decoder of the LMD module. The current value is 6.

  • GNN_dim: It is the dimension of all GNN layers, with a value of 256.

  • hid_dim: It is the LaMMOn hidden dimension, with a value of 256.

We carefully analyze the trade-off between inference speed frames per second (FPS) and IDF1 scores, then choose the most favorable alternatives. The outcome is illustrated in Table 1. We have achieved an FPS of 12.2 and an IDF1 score of 74.12%. This indicates a trade-off between a 6 FPS increase and a 3.2% decrease in IDF1. Furthermore, we assess the efficacy of the T2E module using the T2E_min parameter, which denotes the duration of the video that produces the same amount of object embeddings as synthetic object embeddings. It is assumed that there are 15 cars every minute, with each car appearing on camera for 45 s. These synthetic object embeddings are subsequently used for fine-tuning LGMA in 15 epochs. The data shown in Table 1 demonstrates a substantial increase in the IDF1 score as the length of the synthetic video increases. More precisely, the IDF1 metric shows a 4.7% improvement when 16 min of synthetic video are used.

Table 1 Tuning parameters of increasing FPS and the T2E module on the CityFlow Dataset

Finally, the outcome of LaMMOn is displayed in Table 2, alongside other online MTMCT baselines for comparison. LaMMOn surpasses existing methods and gets the highest outcome, with an IDF1 score of \(78.83\%\), a HOTA score of 76.46%, and an FPS of 12.2. Our model also has the capability to attain higher accuracy to compete with the offline-scenario-only existing MTMCT models (Shim et al., 2021; Hou et al., 2019; Specker et al., 2021). Nevertheless, a compromise exists between the IDF1 score and the inference speed FPS, implying that the most effective models may be excessively sluggish for an online MTMCT application. In addition, a visualization of tracking results in the CityFlow dataset is shown in Fig. 6.

Table 2 Tracking result of LaMMOn and other online MTMCT methods on CityFlow dataset
Fig. 6
figure 6

Visualization of tracking result in CityFlow dataset

4.3.2 I24 Dataset

We use the identical parameters to train the I24 dataset as we did for the CityFlow dataset. The baseline results in Table 3 are reused from the original I24 dataset paper (Gloudemans et al. 2024), excluding the TADAM (Guo et al. 2021), BLSTM-MTP (Kim et al. 2021) and GraphBased Tracklet Nguyen et al. (2023) models. In this experiment, we also used the same train/validation/test sets as described in the original papers (Gloudemans et al., 2024, 2023) to ensure the consistency of the reported results. Table 3 shows that our model outperforms the baselines from the I24 dataset in both HOTA and Recall metrics. Besides, the baseline models, as mentioned in the original paper, use a given detection component, while our proposed approach is an end-to-end model that combines detection and association in one complete framework. With a significant increase of \(5.5\%\) and \(2.5\%\) in HOTA and Recall, respectively, the end-to-end architecture of our proposed model has shown more potential than the state-of-the-art approaches in the real-time MTMCT tasks. It is important to understand that the I24 dataset is extremely large, consisting of 234 h of video, which is 72 times larger than the CityFlow dataset. Due to its size, the ground truth labels have not been completely assigned manually. More precisely, a portion of the data is labeled manually, and the GPS data from 270 vehicles is utilized to establish a matching rule for the manually labeled data. Subsequently, the ground truth label is generated by using the matching rule given above and making manual corrections. As stated in the original paper, the maximum theoretical performance is HOTA 53.1%.

Table 3 Tracking result of LaMMOn and other methods on I24 Dataset

4.3.3 TrackCUIP dataset

For the TrackCUIP dataset, we conduct the training using the same sets of parameters for the CityFlow dataset. Table 4 presents the performance of LaMMOn and other baselines on the TrackCUIP dataset. The results show that LaMMOn outperforms all other baselines, with an increase of 4.42% and 2.82% in IDF1 and HOTA, respectively, while keeping the FPS at an acceptable rate for an online algorithm.

Table 4 Tracking result of LaMMOn and other methods on TrackCUIP dataset

5 Conclusion and future work

We present an innovative solution for MTMCT application with an end-to-end multi-camera tracking model based on transformers and graph neural networks, called LaMMOn. Our model overcomes the limitations of the tracking-by-detection paradigm by introducing a generative approach that enables adaptation to new traffic videos by reducing the need for manual labeling. The synthesis of object embeddings from text descriptions, as demonstrated by our Language Model Detection (LMD) and Text-to-embedding (T2E) modules, significantly reduces the data labeling effort and improves the model’s applicability in different scenarios. In addition, our trajectory clustering method incorporating the Language and Graph Model Association (LGMA) demonstrates the efficiency of using synthetic embeddings for tracklet generation. This approach overcomes the data limitations of multi-camera tracking and ensures adaptability to different traffic scenarios. Finally, LaMMOn demonstrates real-time online capabilities and achieves competitive performance on many datasets, such as CityFlow (HOTA 76.46%, IDF1 78.83%), I24 (HOTA 25.7%) and TrackCUIP (HOTA 80.94%, IDF1 81.83%).

In the future, we aim to improve the robustness of the model further by exploring additional language-based graph features and extending its applicability to different datasets. One possible direction is to delve deeper into optimizing the building of graph structures in extracting camera ID encodings. The success of our model is a significant step towards overcoming the challenges in real-world MTMCT applications and promises improved efficiency and scalability in intelligent transportation systems.