1 Introduction

Fig. 1
figure 1

An overview of the proposed framework. A field of view (FOV) of each camera is shown by a triangle and tracklets and tracks of each camera are shown within each triangle. a First, tracks are determined within each camera, then b tracks of the same person from different non-overlapping cameras are associated, solving the across-camera tracking. Nodes in a represent tracklets and nodes in b represent tracks. \(T^i_j\) is a set of tracklets that form a cluster, resulting in the ith track of camera j. Nodes in b represent tracks form different cameras. Each cluster shown in color represents the tracks of the same person in non-overlapping cameras. Similar color cluster represent the same person (best viewed in color) (Color figure online)

As the need for visual surveillance grows, a large number of cameras are being deployed to cover large and wide areas like airports, shopping malls, city blocks, etc. Since the fields of views of single cameras are limited, in most wide-area surveillance scenarios, multiple cameras are required to cover larger areas. Using multiple cameras with overlapping fields of view is costly from both economical and computational aspects. Therefore, camera networks with non-overlapping fields of view are preferred and widely adopted in real-world applications.

In this work, our goal is to track multiple targets and maintain their identities as they move from one camera to another camera with non-overlapping fields of views. In this context, two problems need to be solved, that is, within-camera data association (or tracking) and across-cameras data association by employing the tracks obtained from within-camera tracking. Although there has been significant progress in both problems separately, tracking multiple targets jointly in both within and across non-overlapping cameras remains a poorly explored topic.

In this paper, we first determine tracks within each camera (Fig. 1a), by solving data association, and later we associate tracks of the same target in different cameras in a unified approach (Fig. 1b), hence solving the across-camera tracking. Since appearance and motion cues of a target tend to be consistent in a short temporal window in a single camera tracking, solving tracking problems in a hierarchical manner is common: tracklets are generated within short temporal windows first and later they are linked or merged to form full tracks (or trajectories).

We cast the tracking problem as finding a cluster of nodes in a graph. Though graph-based approaches are efficient in solving tracking problems, they have limitations with the size of the graph. In this work, we propose a novel fast-constrained dominant sets clustering (FCDSC) technique, a parametrized version of standard quadratic optimization, to solve both within- and across-camera tracking tasks. Typical graph-based methods use the whole graph to solve the clustering problem. In our approach, instead of employing the whole graph, we consider a sub-graph that is much smaller than the original graph. This allows our approach to handle arbitrarily-large graphs and also to be an order of magnitude faster.

Given a constrained set and a graph, FCDSC generates a compact cluster, that is, it selects a subset of nodes from the given graph, which form compact and coherent cluster. Since the nodes in graphs in the first, second and third layers, respectively, represent short-tracklets, tracklets and tracks, corresponding clusters essentially solve the data association among them and define tracklets, tracks and across camera correspondences. The proposed within-camera tracker can robustly handle long-term occlusions, does not change the scale of original problem, as it does not remove nodes from the graph during the extraction of compact clusters, and is several orders of magnitude faster (close to real time) than existing methods.

The proposed across-camera tracking method has several other advantages. More specifically, FCDSC not only considers the affinity (relationship) between tracks, observed in different cameras, but also takes into account the affinity among tracks from the same camera. As a consequence, the proposed approach not only accurately associates tracks from different cameras, but also makes it possible to link multiple short broken tracks obtained during within-camera tracking, which may belong to a single target track. For instance, in Fig. 1a track \(T_1^3\) (third track from camera 1) and \(T_1^4\) (fourth track from camera 1) are tracks of the same person, which were mistakenly broken from a single track. However, during the third layer, as they are highly similar to tracks in camera 2 (\(T_2^3\)) and camera 3 (\(T_3^3\)), they form a cluster, as shown in Fig. 1b. Such across-camera formulation is able to associate these broken tracks with the rest of tracks from different cameras, represented with the green cluster in Fig. 1b.

The contributions of this paper are summarized as follows:

  • We propose a novel fast-constrained dominant set clustering approach, which is highly scalable and 2000 \(\times \) faster than previous constraint dominant set clustering (CDSC) method.

  • We formulate multi-target tracking in multiple non-overlapping cameras as finding compact cluster from a graph and simultaneously solving within- and across-camera tracking.

  • We propose a refinement step which is a principled way to decide between ambiguous tracks and which overall improves the results.

  • Experiments are performed on MOTchallenge Duke-MTMC and MARS datasets, which show improved effectiveness of our method with respect to the state of the art.

The rest of the paper is organized as follows. In Sect. 2, we review relevant previous works. Following that, a brief background on constrained dominant set clustering (CDSC) is present in Sect. 3. The proposed approach for within- and across-cameras tracking modules is summarized in Sect. 4. In particular, in Sect. 4.1, we present the proposed fast-constrained dominant set clustering method, while Sects. 5 and 6 provide details of within- and across-cameras tracking. Next, Sect. 7 discusses our track refinement step. Experimental results are presented in Sect. 8. Finally, Sect. 9 concludes the paper.

2 Related Work

Visual tracking is a very active area of research and has rich literature. It is very difficult to cover all different approaches. For the sake of this, we only cover a few relevant works in the context of this paper and refer readers to some excellent surveys on tracking (Li et al. 2013; Smeulders et al. 2013; Yilmaz et al. 2006). We divide our review into the three following parts.

Single Camera Tracking: Single camera tracking associates target detections across frames in a video sequence in order to generate the target motion trajectory over time. Zamir et al. (2012) formulate tracking problem as generalized maximum clique problem (GMCP), where the relationships between all detections in each frame of a temporal window are considered. In Zamir et al. (2012), a cost is assigned to each clique, which is a complete graph such that one node is selected from each frame, results in a compact cluster corresponding to a track. Then, a clique which maximizes their score function is selected. Nonetheless, the approach is prone to local optima as it uses greedy local neighbourhood search. Dehghan et al. (2015) cast tracking as a generalized maximum multi clique problem (GMMCP) and follow a joint optimization for all the tracks simultaneously. In order to handle outliers and weak-detections associations, they introduce dummy nodes. However, this solution is computationally expensive. In addition, the hard constraint in their optimization makes the approach impractical for large graphs. Authors in Leibe et al. (2007) simultaneously optimized detection and tracking by coupling them into a Quadratic Boolean Problem. In Brendel et al. (2011), they apply a maximum-weight independent set algorithm to hierarchically merge small tracklets into long tracks. In Wang et al. (2014) they formalize tracking as solving min-cost network flow problem by first grouping detections into tracklets and then combining those into tracks. Authors of Hamid Rezatofighi et al. (2015) revisit the Joint Probabilistic Data Association (JPDA) formulation and address the issue of its complexity by leveraging the latest developments in finding the m-best solutions of an integer linear program. The authors in Yu et al. (2016) propose an identity-aware multi-object tracker based on the solution path algorithm. The tracker is formulated as a quadratic optimization problem with \(l_0\) norm constraints, which they solve with the solution path algorithm. The authors in Maksai et al. (2016) formulate ball tracking problem in sport events and related physical constraints in terms of mixed integer programming. However, most of the above approaches suffer computationally when the number of target increases, while our approach can handle arbitrarily-large graphs, as it operates on selected sub-graphs of the bigger graph.

Multi Camera Tracking: Across-camera tracking is a challenging problem due to the illumination and pose changes across cameras, or track discontinuities due to the blind areas or miss detections. Existing across-camera tracking methods try to deal with the above problems using appearance cues. The variation in the appearance due to illumination changes has been dealt with using different techniques, such as brightness transfer functions (BTFs) (Javed et al. 2008). Authors of Gilbert and Bowden (2006) used an incremental learning method to model the color variations, and Prosser et al. (2008) proposed a Cumulative BTF, which is a better use of the available color information from a very sparse training set. Performance comparison of different variations of BTFs can be found in D’Orazio et al. (2009). Authors in Srivastava et al. (2011) tried to achieve color consistency using colorimetric principles, where the image analysis system is modelled as an observer and camera-specific transformations are determined, so that images of the same target appear similar to this observer.

Obviously, learning BTFs or color correction models require large amount of training data and they may not be robust against drastic illumination changes across different cameras. Therefore, recent approaches have combined them with spatio–temporal cues, which improve multi-target tracking performance (Cai and Medioni 2014; Chen et al. 2014; Cheng et al. 2017; Gao et al. 2014; Kuo et al. 2010; Zhang et al. 2015). Cheng et al. (2017) utilized human part configurations for every target track from different cameras to describe the across-camera spatio–temporal constraints for across-camera track association, which is formulated as a multi-class classification problem via Markov random fields (MRFs). In Cai and Medioni (2014) spatio–temporal context is used for collecting samples for discriminative appearance learning. Authors in Chen et al. (2014) learn across-camera transfer models including both spatio–temporal and appearance cues.

However, using only low-level information (appearance and spatio–temporal information) is unreliable for tracking across non-overlapping cameras. Therefore, in this work, for associating tracks across different cameras, besides using their pairwise similarity computed using appearance and spatio–temporal cues, we employ their relative similarity, enforcing that their corresponding FCDSC clusters should also be similar. Also, most of the approaches mentioned above assume within-camera tracking results for all cameras are given. Conversely, the work proposed in this paper addresses a more realistic problem by solving both within- and across-camera tracking in one joint framework.

Recently, the problem of target tracking across multiple non-overlapping cameras has been also tackled in Ristani et al. (2016), by extending their previous single camera tracking method (Ristani and Tomasi 2014), where they formulate the tracking task as a graph partitioning problem. However, their approach gets impractical as their graph gets larger, while we propose a highly-scalable approach, which can handle arbitrarily-large graphs. In Maksai et al. (2017), authors impose global consistency of trajectories by using behavioral patterns to guide the tracking algorithm. They showed that when their approach is used together with the existing state-of-the-art tracking algorithms, it further improves their performance. However, in this approach the initial trajectories are assumed to be given which is a big limitation. In Ristani and Tomasi (2018), authors showed that learning high-quality appearance features lead to good clustering solutions and proposed adaptive weighted triplet loss to learn better feature embeddings. Authors in Yoon et al. (2018) used track-hypothesis trees to solve tracking in multiple cameras. Within-camera tracking is performed simultaneously with the tree formation by manipulating a status of each track hypothesis. However, their approach suffers in handling crowded scenes.

Person Re-Identification: Another recent popular research topic, video-based person re-identification (ReID) (Cong et al. 2009; Farenzena et al. 2010; Liao et al. 2015; Ma et al. 2014; McLaughlin et al. 2016; Wang et al. 2014; Xiong et al. 2014; You et al. 2016; Zheng et al. 2015), is closely related to across-camera multi-target tracking. Both problems aim to match tracks of the same persons across non-overlapping cameras. However, across-camera tracking aims at 1–1 correspondence association between tracks of different cameras. Moreover, person ReID approaches mainly focus on building a strong appearance model to match the same person observed in different views or learning a distance metric (Köstinger et al. 2012; Liao et al. 2015) to maximize the differences between different people. To this end, ReID approaches have made impressive advances. However, most video-based ReID approaches exploit only pairwise affinities between the probes and gallery to get final sorting. By employing FCDSC in conjunction with state-of-the-art video-based ReID algorithms, and by formulating the problem as a constrained quadratic optimization problem, we show that performance of ReID methods can be further increased.

3 Background on Constrained Dominant Set Clustering

In this section, we briefly introduce the basic definitions and formulations of constrained dominant set clustering framework.

As introduced in Zemene and Pelillo (2016), constrained dominant set clustering, a constrained quadratic optimization program, is an efficient approach that has been originally applied to interactive image segmentation. The approach generalizes the dominant set clustering framework (Pavan and Pelillo 2007), which is a well-known generalization of the maximal clique problem to edge weighted graphs. Given an edge-weighted graph G(VEw) and a constraint set \(\mathcal {Q} \subseteq V\), where VE and w, respectively, denote the set of nodes (of cardinality n), edges and edge weights. The objective is to find for sub-graphs that contain all or some of elements of the constraint set, which form a compact cluster.

Consider a graph, G, with n vertices (set V), and its weighted adjacency matrix \(\mathtt {A}\). Given a parameter \( \alpha > 0\), let us define the following parametrized quadratic program:

$$\begin{aligned} \begin{array}{ll} \text {maximize } &{}\quad f_{\mathcal {Q}}^\alpha (\mathbf{x}) = \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}\\ \text {subject to} &{}\quad \mathbf {x} \in \varDelta \end{array} \end{aligned}$$
(1)

where \(\varDelta =\{ \mathbf{x}\in \mathbb {R}^n{:}\,\sum _i x_i = 1, \text { and } x_i \ge 0 \text { for all } i=1 \ldots n\}\). From now onwards we will be calling \(\mathbf {x}\) state, distribution (since elements sum to one, it can be called distribution) and membership score vector interchangeably. \(I_{\mathcal {Q}}\) is the \(n \times n\) diagonal matrix whose diagonal elements are set to 1 in correspondence to the vertices contained in \( V {\setminus } \mathcal {Q}\) (a set V without the elements in \(\mathcal {Q}\)) and to zero otherwise.

Let \(\mathcal {Q} \subseteq V\), with \(\mathcal {Q} \ne \emptyset \) and let \(\alpha > \lambda _{\max }( {\mathtt {A}}_{ V {\setminus } \mathcal {Q}}) \), where \(\lambda _{\max }( {\mathtt {A}}_{ V {\setminus } \mathcal {Q}})\) is the largest eigenvalue of the principal submatrix of \({\mathtt {A}}\) indexed by the elements of \( V {\setminus } \mathcal {Q}\). If \(\mathbf{x}\) is a local maximizer of \(f_{\mathcal {Q}}^\alpha \) in \(\varDelta \), then \(\sigma (\mathbf{x}) \cap \mathcal {Q} \ne \emptyset \), where, \( \sigma (\mathbf{x}) = \{i \in V{:}\,x_i > 0\} \) (Zemene and Pelillo 2016).

The above result provides us with a simple technique to determine clusters containing user-specified query vertices, \(\mathcal {Q}\). Indeed, if \(\mathcal {Q}\) is a vertex selected by the user, by setting

$$\begin{aligned} \alpha > \lambda _{\max }({\mathtt {A}}_{ V {\setminus } \mathcal {Q}}) \end{aligned}$$
(2)

we are guaranteed that all local solutions of (1) will have a support that necessarily contains elements of \(\mathcal {Q}\).

Standard Quadratic Program (StQP) Solvers: The above Standard quadratic program (1) can be solved using dynamics from evolutionary game theory. In Zemene and Pelillo (2016), replicator dynamics, a well-known family of algorithms from evolutionary game theory inspired by Darwinian selection processes (Smith 1988), is employed to solve standard quadratic program. Despite their effectiveness in finding good solutions in a variety of applications, however, replicator dynamics suffer from being computationally expensive, as they require a number of operations per step which grows quadratically with the dimensionality of the problem being solved, \(\mathcal {O}(n^2)\), which makes it inefficient for large scale problems. Efficient out-of-sample (Pavan and Pelillo 2004), an extension of dominant set methods, is used to reduce the computational cost by sampling nodes of the graph using a given sampling rate, which affects the efficiency of the framework. Liu et al. (2013) proposed an iterative clustering algorithm, which operates in two steps: Shrink and Expansion. These steps are used to reduce the run time of replicator dynamics, however they require the whole graph, which makes it slow. The approach has many limitations, such as its preference of sparse graphes with several small clusters and the results are sensitive to several additional parameters.

Fig. 2
figure 2

The figure shows within-camera tracking where short-tracklets (s) from different segments are used as input to our first layer of tracking. The resulting tracklets (t) from the first layer are inputs to the second layer, which determine tracks (T) for each person. The three dark green short-tracklets (\(s_1^2, s_1^{10}, s_1^7\)), shown by dotted ellipse in the first layer, form a cluster resulting in a tracklet (\(t_1^2\)) in the second layer, as shown with the black arrow. In the second layer, each cluster, shown in purple, green and dark red colors, form tracks of different targets, as can be seen on the top row, tracklets and tracks with the same color indicate same target. The two green clusters (with two tracklets and three tracklets) represent tracks of the person going in and out of the building (tracks \(T^p_1\) and \(T^2_1\) respectively) (Color figure online)

All the above formulations, with their limitations, try to minimize the computational complexity of StQP using standard game dynamics, whose complexity is \(\mathcal {O}(n^2)\) for each iteration. Rota Bulò et al. (2011) proposed a new class of evolutionary game dynamics, called Infection and Immunization Dynamics (InfImDyn). It simulates the infection and immunization process. The process which finds a distribution able to infect the population and the process which finds the immune state are all linear processes. Therefore, InfImDyn solves the problem in linear time \(\mathcal {O}(n)\). However, it needs the whole affinity matrix to extract a cluster.

In our approach, we further reduce the computational time by running InfImDyn on a very small sub-graph selected out from the original graph. We propose a principled technique to select sub-graph which contains all possible solutions. We also show that the solution from the sub-graph is a valid solution in the larger graph, which makes the dynamics even more faster, \(\mathcal {O}(r)\), where \(r\ll n\), by significantly reducing the search space. Note that, the proposed approach can be used with all other solvers discussed above to improve their efficiency as it helps reducing the search space.

4 Overall Approach

In our formulation, in the first layer each node in a graph represents a short-tracklet along a temporal window (typically 15 frames) (Fig. 2a). We apply fast-constrained dominant set clustering to determine clusters in this graph, which correspond to tracklets. Likewise, each node in a graph in the second layer represents a tracklet (Fig. 2b), obtained from the first layer, and FCDSC is applied here to determine clusters, which correspond to tracks (Fig. 2c). Finally, in the third layer, nodes in a graph correspond to tracks from different non-overlapping cameras, obtained from the second layer, and FCDSC is applied to determine clusters, which relate tracks of the same person across non-overlapping cameras.

In this section, first we present our proposed fast-constrained dominant set clustering approach. This is followed by formulation of within- and across-camera tracking.

4.1 Fast-Constrained Dominant Set Clustering

In this paper, we propose an algorithm that reduces the search space using the Karush–Kuhn–Tucker (KKT) condition of the constrained quadratic optimization, effectively enforcing the user constraints. In the constrained optimization framework (Zemene and Pelillo 2016), the algorithm computes the eigenvalue of the submatrix for every extraction of the compact cluster, which contains the user constraint set. Computing eigenvalues for large graphs is computationally intensive, which makes the whole algorithm inefficient. In our approach, instead of operating over the whole graph, we localize it on the sub-matrix, selected using the dominant distribution, that is much smaller than the original one. To alleviate the issue with the eigenvalues, we utilize the properties of eigenvalues; a good approximation for the parameter \(\alpha \) is to use the maximum degree of the graph, which, of course, is larger than the eigenvalue of corresponding matrix. The computational complexity, apart from eigenvalue computation, is reduced to \(\mathcal {O}(r)\) where r is the size of the sub-matrix which is employed, which is much smaller than the dimension of the original affinity matrix.

Let us summarize the KKT conditions for quadratic program reported in Eq. (1). By adding Lagrangian multipliers, n non-negative constants \( \mu _1, \ldots , \mu _n\) and a real number \(\lambda \), its Lagrangian function is defined as follows:

$$\begin{aligned} \mathcal {L}(\mathbf{x},\mu ,\lambda ) = f_{\mathcal {Q}}^\alpha (\mathbf{x}) + \lambda \left( 1 - \sum \limits _{i=1}^{n}x_i \right) + \sum \limits _{i=1}^{n}\mu _ix_i. \end{aligned}$$

For a distribution \(\mathbf{x}\), \(\mathbf{x}\in \varDelta \), to be a KKT-point, that is, in order to satisfy the first-order necessary conditions for local optimality (Solow 2007), it should satisfy the following two conditions: the derivative of the Lagrangian \(\mathcal {L}\) should be zero

$$\begin{aligned} 2*[({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}]_i - \lambda + \mu _i = 0, \end{aligned}$$

for all \(i=1 \ldots n\), and

$$\begin{aligned} \sum _{i=1}^n x_i \mu _i = 0. \end{aligned}$$

Since both the \(x_i\) and the \(\mu _i\) are nonnegative, the latter condition is equivalent to saying that \(i \in \sigma (\mathbf{x})\) which implies that \(\mu _i= 0\), from which we obtain:

$$\begin{aligned}{}[({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}]_i ~ \left\{ \begin{array}{ll} = \lambda /2,&{}\quad \text{ if }\,\,i \in \sigma (\mathbf{x}) \\ \le ~ \lambda /2,&{}\quad \text{ if }\,\,i \notin \sigma (\mathbf{x}) \end{array}\right. \end{aligned}$$
(3)

We now need to define a dominant distribution.

Definition 1

A distribution \(\mathbf{y}\in \varDelta \) is said to be a dominant distribution for \(\mathbf{x}\in \varDelta \) if

$$\begin{aligned} \left\{ \sum \limits _{i,j=1}^{n} x_iy_ja_{ij} - \alpha \sum \limits _{i=1}^{n} x_iy_i \right\} > \left\{ \sum \limits _{i,j=1}^{n} x_ix_ja_{ij} - \alpha \sum \limits _{i=1}^{n} x_i^2\right\} ,\nonumber \\ \end{aligned}$$
(4)

where \(a_{ij}\) is the similarity between node i and j.

Let the “support” be \( \sigma (\mathbf{x}) = \{i \in V{:}\,x_i > 0\} \) and \(e_i\) the ith unit vector (a zero vector whose ith element is one).

Proposition 1

Given an affinity \({\mathtt {A}}\) and a distribution \(\mathbf{x}\in \varDelta \), if \(({\mathtt {A}}\mathbf{x})_i > \mathbf{x}'{\mathtt {A}}\mathbf{x}- \alpha \mathbf{x}'_{\mathcal {Q}} \mathbf{x}_{\mathcal {Q}}, \text{ for } \;i \notin \sigma (\mathbf{x})\),

  1. 1.

    \(\mathbf{x}\) is not the maximizer of the parametrized quadratic program of (1)

  2. 2.

    \(e_i\) is a dominant distribution for \(\mathbf{x}\)

Proof

To show that the first condition holds, let us assume \(\mathbf{x}\) is a KKT point

$$\begin{aligned} \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}= \sum \limits _{i=1}^{n} x_i [({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}]_i \end{aligned}$$

Since \(\mathbf{x}\) is a KKT point

$$\begin{aligned} \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}= \sum \limits _{i=1}^{n} x_i * \lambda /2 = \lambda /2. \end{aligned}$$

From the second condition, we have:

$$\begin{aligned}{}[({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}]_i \le \lambda /2 = \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}\end{aligned}$$

Since \(i \notin \sigma (\mathbf{x})\)

$$\begin{aligned} ({\mathtt {A}}\mathbf{x})_i \le \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}. \end{aligned}$$

Which concludes the proof showing that the inequality does not hold.

As for the second condition, if \(e_i\) is a dominant distribution for \(\mathbf{x}\), it should satisfy the inequality

$$\begin{aligned} \left\{ e_i^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}\right\} > \left\{ \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}\right\} . \end{aligned}$$

Since \(i \notin \sigma (\mathbf{x})\) following should hold:

$$\begin{aligned} ({\mathtt {A}}\mathbf{x})_i > \left\{ \mathbf{x}^{\top }({\mathtt {A}} - \alpha I_{\mathcal {Q}}) \mathbf{x}\right\} . \end{aligned}$$

Which concludes the proof. \(\square \)

The proposition provides us with an easy-to-compute dominant distribution.

We summarize the details of our proposed algorithm in Algorithm 1. Given a subset of vertices \(\mathcal {Q} \subseteq V\), the face of \(\varDelta \) corresponding to \(\mathcal {Q}\) is given by: \(\varDelta _\mathcal {Q}\) = \(\{\mathbf{x}\in \varDelta {:}\,\sigma (\mathbf{x}) \subseteq \mathcal {Q}\}\). Here the function \(\mathcal {G}(\mathtt {A},\mathcal {Q})\) returns the local maximizer of program (1). \(\mathcal {S}(\mathtt {A},\mathbf{x})\) returns a dominant distribution, \(\mathbf{x}_d\), for distribution, \(\mathbf{x}\). \(\mathbf{x}_d\) is a distribution whose support contains all possible indices \(i \in V\), which satisfy the second condition from the Proposition 1. Then, \(\mathcal {H}\) is computed by taking the union of the support of \(\mathbf{x}_d\) with the set of indices which are considered as constraint set \(\mathcal {Q}\), which gives us the final sub-graph nodes on which we run the dynamics.

figure a

The selected dominant distribution always increases the value of the objective function. Moreover, the objective function is bounded, which guarantees the convergence of the algorithm.

5 Within-Camera Tracking

Figure 2 shows the proposed within-camera tracking framework. First, we divide a video into multiple short segments, each segment contains f frames (typically 15), and generate short-tracklets, where human detection bounding boxes in two consecutive frames with large (typically 70%) overlap, are connected (Dehghan et al. 2015). Then, short-tracklets from \(\mathcal {N}\) (typically 10) different non-overlapping segments are used as input to our first layer of tracking. Here the nodes are short-tracklets (Fig. 2a). Resulting tracklets from the first layer are used as an input to the second layer, that is, a tracklet from the first layer is now represented by a node in the second layer (Fig. 2b). In the second layer, tracklets of the same person from different segments are associated forming tracks of a person within a camera.

5.1 Formulation Using Fast-Constrained Dominant Sets

We build an input graph, G(VEw), where nodes represent short-tracklet (\(s_i^j\), that is, jth short-tracklet of camera i) in the case of first layer (Fig. 2a) and tracklet (\(t_k^l\), that is, lth tracklet of camera k), in the second layer (Fig. 2b). The corresponding affinity matrix \(\mathtt {A}=\left\{ a_{i,j}\right\} \), where \(a_{i,j}=w(i,j)\) is built. The weight w(ij) is assigned to each edge, by considering both motion and appearance similarity between the two nodes. Fine-tuned CNN features are used to model the appearance of a node. These features are extracted from the last fully-connected layer of Imagenet pre-trained 50-layers Residual Network (ResNet 50) (He et al. 2016) fine-tuned using the Trainval sequence of DukeMTMC dataset. Similar to Zamir et al. (2012), we employ a global constant velocity model to compute motion similarity between two nodes.

Determining Clusters: In our formulation, a cluster from graph G represents tracklet (track) in the first (second) layer. Using short-tracklets/tracklets as a constraint set (in Eq. 1), we enumerate all clusters, employing the proposed approach, by utilizing intrinsic properties of fast-constrained dominant sets. Note that we do not remove the nodes of found clusters from the graph, this keeps the scale of our problem the same (number of nodes in a graph), which guarantees that all the local solutions found are the local solutions of the (original) graph. After the extraction of each cluster, the constraint set is changed in such a way as to make the extracted cluster less favored by the dynamics. This is achieved by enforcing that our algorithm will not be able to select sets of nodes from the previous solutions. The within-camera tracking starts with all nodes as constraint set. Let us say \(\varGamma ^i\) is the ith extracted cluster, \(\varGamma ^1\) is then the first extracted cluster which contains a subset of elements from the whole set. After our first cluster extraction, we change the constraint set to a set \(V{\backslash } \varGamma ^1\), hence rendering its associated nodes unstable. The procedure iterates, updating the constraint set at the ith extraction as \(V{\backslash } \bigcup \nolimits _{l=1}^i \varGamma ^l\), until the constraint set becomes empty. Since we are not removing the nodes of the graph (after each extraction of a cluster), we may end up with the solution that assigns a node to more than one cluster, which results in overlapping clusters.

To find the final solution, we use the notion of centrality of fast-constrained dominant sets. The true class of a node j, which is assigned to \(\mathtt {K} > 1\) cluster, \(\psi = \left\{ \varGamma ^1 \dots \varGamma ^\mathtt {K} \right\} \), is computed as:

$$\begin{aligned} \mathop {{\arg \,\max }}\limits _{\varGamma ^i \in \psi } ~ \left( |\varGamma ^i|*\delta ^i_j\right) , \end{aligned}$$

where the cardinality \(|\varGamma ^i|\) is the number of nodes that forms the ith cluster and \(\delta ^i_j\) is the membership score of node j obtained when assigned to cluster \(\varGamma ^i\). The normalization using the cardinality is important to avoid any unnatural bias to a smaller set.

Algorithm 2, setting the number of cameras under consideration (\(\mathcal {I}\)) equal to 1 and \(\mathcal {Q}\) as short-tracklets (tracklets) set in the first(second) layer, is used to determine a cluster which corresponds to tracklet (track) in the first (second) layer.

6 Across-Camera Tracking

6.1 Graph Representation of Tracks and the Payoff Function

Given tracks \(T_i^j \) of different cameras from the previous step, we build the graph \(G'(V',E',w')\), where nodes represent tracks and their corresponding affinity matrix \(\mathtt {A}\) depicts the similarity between tracks.

Assuming we have \(\mathcal {I}\) cameras and \( \mathtt {A}^{i\times j}\) represents the similarity among tracks of camera i and j, the final track based affinity \(\mathtt {A}\), is built as

$$\begin{aligned} \mathtt {A} = \begin{pmatrix} \mathtt {A}^{1 \times 1} &{} \ldots &{} \mathtt {A}^{1 \times j} &{} \ldots &{} \mathtt {A}^{1 \times \mathcal {I}} \\ . &{} ~~ . ~~ &{} . &{} . &{} .\\ ~~ ~~ &{} ~~ ~~ &{} ~~ ~~ &{} \\ \mathtt {A}^{i \times 1} &{} \ldots &{} \mathtt {A}^{i \times j} &{} . &{} \mathtt {A}^{i \times \mathcal {I}}\\ . &{} ~~ ~~ &{} . &{} . &{} .\\ ~~ ~~ &{} ~~ ~~ &{} ~~ ~~\\ \mathtt {A}^{\mathcal {I} \times 1} &{} \ldots &{} \mathtt {A}^{\mathcal {I} \times j} &{} \ldots &{} \mathtt {A}^{\mathcal {I} \times \mathcal {I}}\\ \end{pmatrix}. \end{aligned}$$

Figure 3 shows an example of a graph for across-camera tracking among three cameras. Black and orange edges, respectively, represent within- and across-camera relations of the tracks. From the affinity \(\mathtt {A}\), \(\mathtt {A}^{i\times j}\) represents the black edges of camera i if \(i = j\), which otherwise represents the across-camera relations using the orange edges. The colors of the nodes depict the track ID: nodes with similar color represent tracks of the same person. Due to several reasons such as long occlusions, severe pose change of a person, reappearance and others, a person may have more than one track (a broken track) within a camera. The green nodes of camera 1 (the second and the pth tracks) typify two broken tracks of the same person, due to reappearance as shown in Fig. 2. The proposed unified approach, as discussed in the next section, is able to deal with such cases.

Fig. 3
figure 3

Exemplar graph of tracks from three cameras. \(T^i_j\) represents the ith track of camera j. Black and colored edges, respectively, represent within- and across-camera relations of tracks. Colors of the nodes depict track IDs, nodes with similar color representing tracks of the same person, and the thick lines show both within- and across-camera association (Color figure online)

6.2 Across-Camera Track Association

In this section, we discuss how we simultaneously solve within- and across-camera tracking. Our framework is naturally able to deal with the errors listed above. In our exemplar graph (Fig. 3), targets represented by a green and red nodes in camera 1 and 2, respectively, has two tracks which are difficult to merge during within-camera tracking; however, they belong to a cluster with track(s) in camera 2 and 3 in the case of the green target and camera 1 for the red target, since they are highly similar. The algorithm applied to a such across-camera graph is able to cluster all the correct tracks. This helps us linking broken tracks of the same person occurring during within-camera track generation stage.

Using the graph with nodes of tracks from a camera as a constraint set, data association for both within- and across-camera are performed simultaneously. Let us assume, in our exemplar graph (Fig. 3), that our constraint set \(\mathcal {Q}\) contains nodes corresponding to tracks of camera 1, \(\mathcal {Q} = \{T^1_1, T^2_1, T^i_1, T^p_1\}\). \(I_\mathcal {Q}\) is then a \(n \times n\), where n is number of tracks in all cameras, diagonal matrix, whose diagonal elements are set to 1 in correspondence to the vertices contained in all cameras, except camera 1 which takes the value zero. That is, the sub-matrix \(I_\mathcal {Q}\), that corresponds to \(A^{1\times 1}\), will be a zero matrix of size equal to number of tracks of the corresponding camera. Setting \(\mathcal {Q}\) as above, we have guaranteed that the maximizer of program in Eq. (1) contains some elements from set \(\mathcal {Q}\): i.e., \(\mathcal {C}_1^1=\left\{ T^2_1, T^p_1, T^q_2, T^2_3\right\} \) forms a cluster which contains set \(\left\{ T^2_1, T^p_1\right\} \in \mathcal {Q}\). This is shown in Fig. 3, using the thick green edges (which illustrate across-camera track association) and the thick black edge (which typifies the within-camera track association). The second set \(\mathcal {C}_1^2\) contains tracks shown with the dark red color, which illustrates the case where within- and across-camera tracks are in one cluster. Lastly, the \(\mathcal {C}_1^3 = T^1_1\) represents a track of a person that appears only in camera 1. As a general case, \(C^i_j\) represents the ith track set using tracks in camera j as a constraint set and \(C_j\) is the set that contains track sets generated using camera j as a constraint set, e.g. \(C_1 = \left\{ C^1_1, C^2_1, C^3_1\right\} \). We iteratively process all the cameras and then apply track refinement step, described in Sect. 7.

Though Algorithm 2 is applicable to within-camera tracking also, here we show the specific case for across-camera track association. Let \(\mathcal {T}\) represents the set of tracks from all the cameras we have and C is the set which contains sets of tracks, as \(C^i_p\), generated using our algorithm. \(T^{\vartheta }_p\) typifies the \(\vartheta \)th track from camera p and \(T_p\) contains all the tracks in camera p. The function \(\mathcal {F}(\mathcal {Q},\mathtt {A} \)) takes as an input a constraint set \(\mathcal {Q}\) and the affinity \(\mathtt {A}\), and provides as output all the m local solutions \(\mathcal {X}^{n\times m}\) of program (1) that contain element(s) from the constraint set. This can be accomplished by iteratively finding a local maximizer of equation (program) (1) in \(\varDelta \), e.g. using game dynamics, and then changing the constraint set \(\mathcal {Q}\), until all members of the constraint set have been clustered.

figure b

7 Track Refinement

During cross camera tracking, tracking results using tracks from distinct cameras as different constraint sets, might result in overlapping clusters (clusters with some common nodes), since we are not removing nodes of extracted clusters from the graph at every iteration. Thus, we propose a refinement step to help assign those ambiguous nodes (which are assigned to multiple clusters) to their right cluster. To achieve this, we employ the notion of centrality of fast-constrained dominant sets and the notion of reciprocal neighbors, that is, if two nodes (from different camera) are similar to each other, then their corresponding neighbors (nodes which are in their respective cluster) are expected to be similar too.

Let us assume we have \(\mathcal {I}\) cameras and \(\mathcal {K}^i\) represents the set corresponding to track i, while \(\mathcal {K}^i_p\) is the subset of \(\mathcal {K}^i\) that corresponds to the pth camera. \(\mathcal {M}^{l^i}_p\) is the membership score assigned to the lth track in the set \(\mathcal {C}^i_p\).

We use two constraints during track refinement stage, which helps us refining false positive association.

Constraint 1:A track can not belong to two different sets generated using the same constraint set, i.e. it must hold that:

$$\begin{aligned} |\mathcal {K}_p^i| \le 1 \end{aligned}$$

Sets that do not satisfy the above inequality should be refined, as there is one or more tracks that exist in different sets of tracks collected using the same constraint, i.e. \(T_p\). The corresponding track is removed from all the sets which contain it and is assigned to the right set based on its membership score in each of the sets. Let us say the lth track exists in q different sets, when tracks from camera p are taken as a constraint set, \(|\mathcal {K}^l_p|=q\). The right set which contains the track, \(C^r_p\), is chosen as:

$$\begin{aligned} C^r_p = \mathop {{\arg \,\max }}\limits _{C^i_p \in \mathcal {K}^l_p} ~ \left( |C^i_p|*\mathcal {M}^{l^i}_p \right) . \end{aligned}$$

where \(i = 1, \dots , |\mathcal {K}^l_p| \). This must be normalized with the cardinality of the set to avoid a bias towards smaller sets.

Constraint 2:The maximum number of sets that contain track i should be equal to the number of cameras under consideration. For \(\mathcal {I}\) cameras, the cardinality of the set which contains sets with track i is not larger than \(\mathcal {I}\), i.e.:

$$\begin{aligned} |\mathcal {K}^i| \le \mathcal {I}. \end{aligned}$$

If there are sets that do not satisfy the above condition, the tracks are refined based on the cardinality of the intersection of sets that contain the track by enforcing the reciprocal properties of the sets which contain a track. Assume we collect sets of tracks considering tracks from camera q as constraint set, and assume track \(\vartheta \) in the set \(C^j_p\), \(p\ne q\) exists in more than one sets of \(C_q\). The right set, \(C^r_q\), for \(\vartheta \) considering tracks from camera q as constraint set is chosen as:

$$\begin{aligned} C^r_q = \mathop {{\arg \,\max }}\limits _{C^i_q \in \mathcal {K}^\vartheta _q} ~ \left( C^i_q \cap C^j_p \right) . \end{aligned}$$

where \(i = 1, \dots , |\mathcal {K}^\vartheta _q| \).

Fig. 4
figure 4

Camera topology for DukeMTMC dataset. Detections from the overlapping fields of view of Cameras (#2 and #8) and (#3 and #5) are not considered

8 Experimental Results

The proposed framework has been evaluated on a recently-released large dataset, MOTchallenge DukeMTMC (Ristani et al. 2016; Solera et al. 2016; Ristani and Tomasi 2014). Even though the main focus of this paper is on multi-target tracking in multiple non-overlapping cameras, we also perform additional experiments on MARS (Zheng et al. 2016), one of the largest and most challenging video-based person re-identification dataset, to show that the proposed across-camera tracking approach can efficiently solve this task too.

DukeMTMC has been recently released to evaluate the performance of multi-target multi-camera tracking systems. It is the largest (to date), fully-annotated and calibrated high resolution 1080p, 60 fps dataset that covers a single outdoor scene from 8 fixed synchronized cameras. The topology of cameras is shown in Fig. 4. The dataset consists of 8 videos of 85 min each from the 8 cameras, with 2700 unique identities (IDs) in more than 2 millions frames in each video, containing from 0 to 54 people. The videos are split in three parts: (1) Trainval (first 50 min of the video), which is for training and validation; (2) Test-Hard (next 10 min after Trainval sequence); and (3) Test-Easy, which covers the last 25 min of the video. Some of the properties which make the dataset challenging include: huge amount of data to process; it contains 4159 hand-overs; there are more than 1800 self-occlusions (with 50% or more overlap); there are 891 people walking in front of only one camera.

MARS (Motion Analysis and Re-Identification Set) is an extension of the Market-1501 dataset (Zheng et al. 2016). It has been collected from six near-synchronized cameras. It consists of 1261 different pedestrians, who are captured by at least 2 cameras. The variations in poses, colors and illuminations of pedestrians, as well as the poor image quality, make it very difficult to yield high matching accuracy. Moreover, the dataset contains 3248 distractors in order to make it more realistic. Deformable part model (DPM) (Felzenszwalb et al. 2010) and GMMCP tracker (Dehghan et al. 2015) were used to automatically generate the tracklets (mostly 25–50 frames long). Since the video and the detections are not available we used the generated tracklets as an input to our framework.

Performance Measures: In addition to the standard multi-target multi-camera tracking performance measures, we evaluate our framework using additional measures recently proposed in Ristani et al. (2016): identification F-measure (IDF1), identification precision (IDP) and identification recall (IDR). The standard performance measures, such as CLEAR MOT, report the amount of incorrect decisions made by a tracker. Ristani et al. (2016) argue and demonstrate that some system users may instead be more interested in how well they can determine who is where at all times. After pointing out that different measures serve different purposes, they proposed the three measures (IDF1, IDP and IDR) which can be applied both within- and across-cameras. These measure tracker’s performance not by how often ID switches occur, but by how long the tracker correctly tracks targets.

Identification precision IDP (recall IDR) is the fraction of computed (ground truth) detections that are correctly identified.

Identification F-measure IDF1 is the ratio of correctly identified detections over the average number of ground-truth and computed detections.

Since MOTA and its related performance measures under-report across-camera errors (Ristani et al. 2016), we use them for the evaluation of our single camera tracking results.

The performance of the algorithm for re-identification is evaluated employing rank-1 based accuracy and confusion matrix using average precision (AP).

Implementation: In the implementation of our framework, we do not have parameters to tune. The affinity matrix \(\mathtt {A}\) is constructed as follows (by adapting kernel trick distance function from Grossman et al. 2005):

$$\begin{aligned} \mathtt {A}_{i,j} = 1 - \sqrt{\frac{\mathtt {K}(x_i,x_i)+\mathtt {K}(x_j,x_j) - 2*\mathtt {K}(x_i,x_j)}{2}}, \end{aligned}$$

where \(\mathtt {K}(x_i,x_j)\) is chosen as the Laplacian kernel

$$\begin{aligned} \mathtt {K}(x_i,x_j) = exp\left( -\gamma \parallel x_i - x_j \parallel _1\right) . \end{aligned}$$

The kernel parameter \(\gamma \) is set as the inverse of the median of pairwise distances.

In our similarity matrix for the final layer of the framework, which is sparse, we use spatio–temporal information based on the time duration and the zone of a person moving from one zone of a camera to other zone of another camera, which is learned from the Trainval sequence of DukeMTMC dataset. The affinity between track i and track j is different from zero if and only if they have a possibility, based on the direction a person is moving and the spatio–temporal information, to be linked and form a trajectory (across-camera tracks of a person). However, this may have a drawback due to broken tracks or track of a person who is standing and talking or doing other activities in one camera, which results in a track that does not meet the spatio–temporal constraints. To deal with this problem, we add, for the across-camera track’s similarity, a path-based information as used in Zemene and Pelillo (2015), i.e. if a track in camera i and a track in camera j have a probability to be linked, and similarly j in turn have a possibility to be linked with a track in camera z, the tracks in camera i and camera z are considered to have a possibility to be linked.

The similarity between two tracks is computed using the Euclidean distance of the max-pooled features. The max-pooled features are computed as the row maximum of the feature vector of individual patches of the given track, extracted from the last fully-connected layer of Imagenet pre-trained 50-layers Residual Network (ResNet_50) (He et al. 2016), fine-tuned using the Trainval sequence of DukeMTMC dataset. The network is fine-tuned with classification loss on the Trainval sequence, and activations of its last fully-connected layer are extracted, L2-normalized and taken as visual features. Cross-view Quadratic Discriminant Analysis (XQDA) (Liao et al. 2015) is then used for pairwise distance computation between instances. For the experiments on MARS, patch representation is obtained using CNN features used in Zheng et al. (2016). The pairwise distances between instances are then computed in XQDA, KISSME (Köstinger et al. 2012) and Euclidean spaces.

8.1 Evaluation on DukeMTMC Dataset

In Tables 1 and 2, we compare quantitative performance of our method with state-of-the-art multi-camera multi-target tracking method on the DukeMTMC dataset. As the ground truth for the test set is not publicly available, we compared with approaches whose results are present on the scoring board.Footnote 1 The symbol \(\uparrow \) means higher scores indicate better performance, while \(\downarrow \) means lower scores indicate better performance. The quantitative results of the trackers shown in Table 1 represent the performance on the Test-Easy sequence, while those in Table 2 show the performance on the Test-Hard sequence. For a fair comparison, we use the same detection responses obtained from MOTchallenge DukeMTMC as the input to our method. In both cases, the reported results of row ‘Camera 1’ to ‘Camera 8’ represent the within-camera tracking performances. The last row of the tables represent the average performance over 8 cameras. Both tabular results demonstrate that the proposed approach improves tracking performance for both sequences. In the Test-Easy sequence, the performance is improved by 11.5% in MOTA and 5.8% in IDF1 metrics, while in that of the Test-Hard sequence, our method produces 5% higher average MOTA score than (Ristani et al. 2016), and 0.4% improvement is achieved in IDF1 w.r.t. (Maksai et al. 2017). Tables 3 and 4, respectively, present Multi-Camera performance of our and state-of-the-art approaches (Ristani et al. 2016; Maksai et al. 2017) on the Test-Easy and Test-Hard sequence of DukeMTMC dataset. We have improved IDF1 for both Test-Easy and Test-Hard sequences by about 4% and 3%, respectively.

Table 1 The results show detailed (for each camera) and average performance of our and state-of-the-art approaches (Ristani et al. 2016; Maksai et al. 2017) on the Test-Easy sequence of DukeMTMC dataset
Table 2 The results show detailed (for each camera) and average performance of our and state-of-the-art approaches (Ristani et al. 2016; Maksai et al. 2017) on the Test-Hard sequence of DukeMTMC dataset

Figures 5, 67 and 8 depict sample qualitative results. Each person is represented by (similar color of) two bounding boxes, which represent the person’s position at some specific time, and a track which shows the path s(he) follows. In Fig. 5, all the four targets, even under significant illumination and pose changes, are successfully tracked in four cameras, where they appear. In Fig. 6, target 714 is successfully tracked through three cameras. Observe its significant illumination and pose changes from camera 5 to camera 7. Figure 7 shows targets that move through camera 1, 6, 7 and 8. Finally, Fig. 8 shows sample tracks of targets that appear in cameras 1–4.

8.2 Evaluation on MARS Dataset

In this experiment, given the query (constraint), we first extract a cluster which is guaranteed to contain the query and other members from the gallery which are highly similar with each other. We then use the notion of centrality of FCDSC, where a membership score (which depicts their proximity to the query) is assigned to each element in the cluster. These scores are then used to sort the extracted tracks. In Table 5 we compare our results (using the same settings as in Zheng et al. (2016)) on MARS dataset. The proposed approach achieves about 3% of improvement. In Table 6 the results show performance of our and state-of-the-art approach (Zheng et al. 2016) in solving the within-(average of the diagonal of the confusion matrix, Fig. 9) and across-camera (off-diagonal average) ReID using average precision. Our approach shows up to 10% improvement in the across-camera ReID and up to 6% improvement in the within-camera ReID.

Table 3 Multi-camera performance of our and state-of-the-art approaches (Ristani et al. 2016; Maksai et al. 2017) on the Test-Easy sequence of DukeMTMC dataset
Table 4 Multi-Camera performance of our and state-of-the-art approaches (Ristani et al. 2016; Maksai et al. 2017) on the Test-Hard sequence of DukeMTMC dataset
Fig. 5
figure 5

Sample qualitative results of the proposed approach on DukeMTMC dataset. All the four targets [with id, 216 (pink), 212 (light blue),202 (blue) and 171 (red)], even under significant illumination and pose changes, are successfully tracked in four cameras (cam 1,5,2 and 6), where they appear (best viewed in color and zoomed) (Color figure online)

Fig. 6
figure 6

Target 714 (blue) is successfully tracked through camera 5,7 and 4. Discern its significant illumination and pose changes from camera 5 to camera 7. Similarly, Target 238 (red) is correctly tracked in camera 2,5 and 7 (best viewed in color and zoomed) (Color figure online)

Fig. 7
figure 7

Regardless of the change in pose and illumination, targets 1404 (light blue), 1162 (red) and 1360 (blue) are tracked correctly in camera 6, 8, and 7. Similarly, target 59 (pink) and 79 (green) are correctly tracked as they appear in camera 1 and 8 (best viewed in color and zoomed) (Color figure online)

Fig. 8
figure 8

The proposed approach is able to track target 219 (red), with heavy change in pose and illumination, correctly in all four cameras (3, 4, 1 and 2). Target 256 (blue) is correctly tracked in camera 3, 1 and 2. Similarly, our method correctly identity target 343 (light blue) as she appears in camera 3, 4 and 2 (best viewed in color and zoomed) (Color figure online)

To show how much meaningful the notion of centrality of fast-constrained dominant set is, we conduct an experiment on the MARS dataset computing the final ranking using the membership score and pairwise distances. The confusion matrix in Fig. 9 shows the detail result of both the within-cameras (diagonals) and across-cameras (off-diagonals), as we consider tracks from each camera as query. Given a query, a set which contains the query is extracted using the fast-constrained dominant set framework. Note that fast-constraint dominant set comes with the membership scores for all members of the extracted set. We show in Fig. 9 the results based on the final ranking obtained using membership scores (left) and using pairwise Euclidean distance between the query and the extracted nodes (right). As can be seen from the results in Table 6 (average performance) the use of membership score outperforms the pairwise distance approach, since it captures the interrelation among targets.

Table 5 The table shows the comparison (based on rank-1 accuracy) of our approach with the state-of-the-art approaches on MARS dataset
Table 6 The results show performance of our, using pairwise distance (PD) and membership score (MS), and state-of-the-art approach (Zheng et al. 2016) in solving within- and across-camera ReID using average precision on MARS dataset using CNN feature and different distance metrics

8.3 Computational Time

Figure 10 shows the time taken for each track—from 100 randomly selected (query) tracks from MARS dataset—to be associated with the rest of the (gallery) tracks, after running CDSC (Zemene and Pelillo 2016) and the newly proposed FCDSC over the constructed graph. The vertical axis is the CPU time in seconds and horizontal axis depicts the track IDs. As it is evident from the plot, our approach takes a fraction of second (red points in Fig. 10). Conversely, the CDSC takes up to 8 s for some cases (green points in Fig. 10). Figure 11 further elaborates how fast our proposed approach is over CDSC, where the vertical axis represents the ratio between CDSC (numerator) and FCDSC (denominator) in terms of CPU time. This ratio ranges from 2000 (the proposed FCDSC 2000 \(\times \) faster than CDSC) to a maximum of above 4500.

Fig. 9
figure 9

The results show the performance of our algorithm on MARS (both using CNN \(+\) XQDA) when the final ranking is done using membership score (left) and using pairwise Euclidean distance (right)

Fig. 10
figure 10

CPU time taken for each track association using our proposed approach (FCDSC) and CDSC (Color figure online)

Fig. 11
figure 11

The ratio of CPU time taken between CDSC and proposed approach (FCDSC), computed as CPU time for CDSC/CPU time for FCDSC

In our non-optimized Matlab code, with 60 GB RAM, i7, 3.1 GHz windows machine, the whole tracking algorithms (excluding detection), runs at 18 fps.

9 Conclusions

In this paper we presented a novel fast-constrained dominant set clustering (FCDSC) framework for solving multi-target tracking problem in multiple non-overlapping camera settings. The proposed method utilizes a three-layers hierarchical approach, where within-camera tracking is solved using first two layers of our framework resulting in tracks for each person, and later in the third layer the proposed across-camera tracker merges tracks of the same person across different cameras. Experiments on a challenging real-world dataset (MOTchallenge DukeMTMCT) validate the effectiveness of our model.

We performed additional experiments to show effectiveness of the proposed approach on one of the largest video-based people re-identification dataset (MARS). Here, each query is treated as a constraint set and its corresponding members in the resulting constrained dominant set cluster are considered as possible candidate matches to their corresponding query.

There are few directions we would like to pursue in our future research. In this work, we considered a static cameras with known topology (layout of the scene, including the positions of the cameras in the scene), but it is important for the approach to be able to handle more challenging scenarios, where some views are from cameras with ego motion (e.g., PTZ cameras or taken from mobile devices) with unknown camera topology. Moreover, here we considered features from static images, however, we believe video features which can be extracted using LSTM could boost the performance and help us extending the method to handle challenging scenarios.