Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Multiple people tracking is a key problem for many computer vision tasks, such as surveillance, animation or activity recognition. In crowded environments occlusions and false detections are common, and although there have been substantial advances in the last years, tracking is still a challenging task. Tracking is often divided in two steps: detection, finding the objects of interest on every frame, and data association, matching the detections to form complete trajectories in time. Researchers have presented improvements on the object detector [9, 13, 34, 36] as well as on the optimization techniques [18, 23] and even specific algorithms have been developed for tracking in crowded scenes [2, 32]. Though each object can be tracked separately, recent works have proven that tracking objects jointly and taking into consideration their interaction can give much better results in complex scenes. Current research is mainly focused on two aspects to exploit the interaction between pedestrians: the use of a global optimization strategy [7, 21, 40] and a social motion model [30, 38]. The focus of this chapter is to give a detailed overview of multiple people trackers which include either a global optimization method or social behavior information to improve tracking results in crowded scenarios. Finally, the chapter discusses an approach to marry both concepts and include the social behaviors in a global optimization tracking system (Fig. 11.1).

Fig. 11.1
figure 1

Terms of the social force model. (a) Constant velocity assumption. (b) Avoidance forces. (c) Group attraction forces

1.1 Related Work

Current research is mainly focused on two aspects to exploit the interaction between pedestrians: the use of a global optimization strategy and a social motion model. In this section, we discuss both research trends.

Global Optimization: The optimization strategy deals with the data association problem, which is usually solved on a frame-by-frame basis or one track at a time. Several methods can be used such as Markov Chain Monte Carlo (MCMC) [19], multi-level Hungarian [20], inference in Bayesian networks [27] or the Nash Equilibrium of game theory [39]. In [6] an efficient approximative Dynamic Programming (DP) scheme is presented, in which trajectories are estimated one after the other. This means that if a trajectory is formed using a certain detection, the other trajectories which are computed later will not be able to use that detection anymore. This obviously does not guarantee a global optimum for all trajectories. Recent works show that global optimization can be more reliable in crowded scenes as it solves the matching problem jointly for all tracks. The multiple object tracking problem is defined as a linear constrained optimization flow problem and Linear Programming (LP) is commonly used to find the global optimum. The idea was first used for people tracking in [16], although this method needs to know a priori the number of targets to track, which limits its application in real tracking situations. In [7], the scene is divided into identical cells, each represented by a node in the constructed graph. Using the information of the Probability Occupancy Map, the problem is formulated either as a max-flow and solved with Simplex, or as a min-cost and solved using k-shortest paths, which is a more efficient solution. Both methods show a far superior performance when compared to the same approach with DP [6]. The authors of [3] also define the problem as a maximum flow on an hexagonal grid, but instead of using matching individual detections, they make use of tracklets. This has the advantage that they can precompute the social forces for each of these tracklets, nonetheless, the fact that the tracklets are chosen locally, means the overall matching is not truly global, and if errors occur during the creation of the tracklets, these cannot be overcome by the global optimization. In [40] the tracking problem is formulated as a Maximum A-Posteriori (MAP) problem, which is mapped to a minimum-cost network flow and then efficiently solved using LP. In this case, each node represents a detection, which means the graph is much smaller compared to [3, 7]. Finally, [37] propose to combine global and local methods to match trajectories across cameras and across time, while a unique global formulation for the multi-view multi-object is presented in [22].

Social Behavior for Tracking: Most tracking systems work with the assumption that the motion model for each target is independent. This simplifying assumption is especially problematic in crowded scenes: imagine the chaos if every pedestrian followed his or her chosen path and completely ignored the other pedestrians in the scene. In order to avoid collisions and reach the chosen destination at the same time, a pedestrian follows a series of social rules or social forces. These have been defined in what is called the Social Force Model (SFM) [15], which has been used for abnormal crowd behavior detection [26], crowd simulation [28] and has only recently been applied to multiple people tracking: in [33], an energy minimization approach is used to predict the future position of each pedestrian considering all the terms of the social force model. In [30] and [24], the social forces are included in the motion model of the Kalman or Extended Kalman filter, while the authors in [4] discuss the type of energy needed to include information about the dynamic model, repulsion, etc. and how to optimize it using the standard conjugate gradient method. In [14] a method is presented to detect small groups of people in a crowd, but it is only recently that grouping behavior has been included in a tracking framework [10, 29, 38]. In [29] groups are included in a graphical model which contains cycles and, therefore, Dual Decomposition [8] is needed to find the solution, which obviously is computationally much more expensive than using Linear Programming. Moreover, the results presented in [29] are only for short time windows. On the other hand, the formulations of [10, 38] are predictive by nature and therefore too local and unable to deal with trajectory changes (e.g. when people meet and stop to talk).

Recently, a new approach [21] includes social and grouping models into a global optimization framework, allowing for a better estimate of the true maximum a-posteriori probability of the trajectories and therefore further improving tracking results, especially in crowded scenes.

2 Multiple People Tracking

Tracking is commonly divided in two steps: object detection and data association. First, the objects are detected in each frame of the sequence and second, the detections are matched to form complete trajectories. In this section we define the data association problem and describe how to convert it to a minimum-cost network flow problem, which can be efficiently solved using Linear Programming.

The idea is to build a graph in which the nodes represent the pedestrian detections. These nodes are fully connected to past and future observations by edges, which determine the relation between two observations with a cost. Thereby, the matching problem is equivalent to a minimum-cost network flow problem: finding the optimal set of trajectories is equivalent to sending flow through the graph so as to minimize the cost. This can be efficiently computed using the Simplex algorithm or k-shortest paths [11].

2.1 Problem Statement

Let \(\mathcal{O} =\{ \mathbf{o}_{k}t\}\) be a set of object detections with \(\mathbf{o}_{k}^{t} = (\mathbf{p}_{k},t)\), where p k  = (x, y, z) is the 3D position and t is the time stamp. A trajectory is defined as a list of ordered object detections \(T_{k} =\{ \mathbf{o}_{k}^{1},\mathbf{o}_{k}^{2},\cdots \,,\mathbf{o}_{k}^{N}\}\), and the goal of multiple object tracking is to find the set of trajectories \(\mathcal{T}{\ast} =\{ T_{k}\}\) that best explains the detections.

This is equivalent to maximizing the a-posteriori probability of \(\mathcal{T}\) given the set of detections \(\mathcal{O}\), which is known as maximum posterior or MAP problem.

$$\displaystyle\begin{array}{rcl} \mathcal{T}{\ast} =\mathop{\mathop{ \mathbf{argmax}}\nolimits }\limits_{\mathcal{T}}\ P(\mathcal{T}\vert \mathcal{O})& &{}\end{array}$$
(11.1)

Further assuming that detections are conditionally independent, the objective function is expressed as:

$$\displaystyle\begin{array}{rcl} \mathcal{T}{\ast} =\mathop{\mathop{ \mathbf{argmax}}\nolimits }\limits_{\mathcal{T}}\ P(\mathcal{O}\vert \mathcal{T} )P(\mathcal{T} ) =\mathop{\mathop{ \mathbf{argmax}}\nolimits }\limits_{\mathcal{T}}\ \prod \limits _{k}P(\mathbf{o}_{k}\vert \mathcal{T} )P(\mathcal{T} )& &{}\end{array}$$
(11.2)

\(P(\mathbf{o}_{k}\vert \mathcal{T} )\) is the likelihood of the detection. Optimizing Eq. (11.2) directly is intractable since the space of \(\mathcal{T}\) is huge, nonetheless we make the assumption that the trajectories cannot overlap (i.e., a detection cannot belong to two trajectories) to obtain:

$$\displaystyle\begin{array}{rcl} \mathcal{T}{\ast} =\mathop{\mathop{ \mathbf{argmax}}\nolimits }\limits_{\mathcal{T}}\prod \limits _{k}P(\mathbf{o}_{k}\vert \mathcal{T} )\prod _{T_{k}\in \mathcal{T}}P(T_{k})& &{}\end{array}$$
(11.3)

where the trajectories are represented by a Markov chain:

$$\displaystyle\begin{array}{rcl} P(\mathcal{T} ) =\prod \limits _{T_{k}\in \mathcal{T}}P_{\mathrm{in}}(\mathbf{o}_{k}^{1})P(\mathbf{o}_{ k}^{2}\vert \mathbf{o}_{ k}^{1})\ldots P(\mathbf{o}_{ k}^{t}\vert \mathbf{o}_{ k}^{t-1})\ldots P(\mathbf{o}_{ k}^{N}\vert \mathbf{o}_{ k}^{N-1})P_{\mathrm{ out}}(\mathbf{o}_{k}^{N})& &{}\end{array}$$
(11.4)

where \(P_{\mathrm{in}}(\mathbf{o}_{k}^{t})\) is the probability that a trajectory is initiated with detection o k t, \(P_{\mathrm{out}}(\mathbf{o}_{k}^{t})\) the probability that the trajectory is terminated at o k t and \(P(\mathbf{o}_{k}^{t}\vert \mathbf{o}_{k}^{t-1})\) is the probability that o k t−1 is followed by o k t in the trajectory.

2.2 Tracking with Linear Programming

In this section, we explain how to convert the MAP problem into a Linear Program, which is a particularly interesting since it can be efficiently solved in polynomial time using any of the available techniques from the optimization community [1].

A linear programming problem consists in minimizing or maximizing a linear function in the presence of linear constraints, which can be both equalities and inequalities.

$$\displaystyle\begin{array}{rcl} \mathrm{Minimize}\quad& c_{1}f_{1} + c_{2}f_{2} +\ldots +c_{n}f_{n}&{}\end{array}$$
(11.5)
$$\displaystyle\begin{array}{rcl} & & \mathrm{Subjectto}\quad a_{11}f_{1} + a_{12}f_{2} +\ldots +a_{1n}f_{n} \geq b_{1} \\ & & ~~~~~~\quad\quad\quad a_{21}f_{1} + a_{22}f_{2} +\ldots +a_{2n}f_{n} \geq b_{2} \\ & & ~~~~~~\quad\quad\quad\vdots\quad\quad\vdots\quad\quad\quad\vdots \\ & & ~~~~~~\quad\quad\quad a_{m1}f_{1} + a_{m2}f_{2} +\ldots +a_{mn}f_{n} \geq b_{m} {}\end{array}$$
(11.6)

where Eq. (11.5) is the objective function and Eq. (11.6) are the constraints. \(c_{1},c_{2},\ldots,c_{n}\) denote the known cost coefficients and \(f_{1},f_{2},\ldots,f_{n}\) are the decision variables to be determined.

To convert our problem into a linear program, we linearize the objective function by defining a set of flow flags \(f_{i,j} =\{ 0,1\}\) which indicate if an edge (i, j) is in the path of a trajectory or not.

In a minimum cost network flow problem, the objective is to find the values of the variables that minimize the total cost of the flows over the network. Defining the costs as negative log-likelihoods, and combining Eqs. (11.3) and (11.4), the following objective function is obtained:

$$\displaystyle\begin{array}{rcl} \mathcal{T}{\ast}& =& \mathop{\mathop{\mathbf{argmin}}\nolimits }\limits_{\mathcal{T}}\sum _{T_{k}\in \mathcal{T}}-\log P(T_{k}) -\sum _{k}\log P(\mathbf{o}_{k}\vert \mathcal{T} ) \\ & =& \mathop{\mathop{\mathbf{argmin}}\nolimits }\limits_{\mathcal{T}}\sum _{i}C_{\mathrm{in},i}f_{\mathrm{in},i} +\sum _{i,j}C_{i,j}f_{i,j} +\sum _{i}C_{i}f_{i} +\sum _{i}C_{i,\mathrm{out}}f_{i,\mathrm{out}}{}\end{array}$$
(11.7)

subject to the following constraints:

  • Edge capacities: assuming each detection can only correspond to one trajectory, the edge capacities have an upper bound of u ij  ≤ 1 and:

    $$\displaystyle\begin{array}{rcl} f_{\mathrm{in},i} + f_{i} \leq 1\qquad f_{i,\mathrm{out}} + f_{i} \leq 1& & {}\end{array}$$
    (11.8)
  • Flow conservation at the nodes:

    $$\displaystyle\begin{array}{rcl} f_{\mathrm{in},i} + f_{i} =\sum \limits _{j}f_{i,j}\qquad \sum \limits _{j}f_{j,i} = f_{i,\mathrm{out}} + f_{i}& & {}\end{array}$$
    (11.9)
  • Exclusion property:

    $$\displaystyle\begin{array}{rcl} f_{i,j} =\{ 0,1\}& & {}\end{array}$$
    (11.10)

The condition in Eq. (11.10) requires us to solve an integer program, which is known to be NP-complete. Nonetheless, we can relax the condition to have the following linear equation:

$$\displaystyle\begin{array}{rcl} 0 \leq f_{i,j} \leq 1.& &{}\end{array}$$
(11.11)

Now the problem is defined and can be solved as a linear program. If certain conditions are fulfilled, the solution \(\mathcal{T}{\ast}\) will still be integer, and therefore will also be the optimal solution to the initial integer program. We discuss the integrality of the solution in more detail in Sect. 11.4.

Fig. 11.2
figure 2

Example of a graph with the special source s and sink t nodes, six detections which are represented by two nodes each: the beginning b i and the end e i

To map this formulation into a cost-flow network, we define G = (N, E) to be a directed network with a cost C i, j and a capacity u ij associated with every edge (i, j) ∈ E. An example of such a network is shown in Fig. 11.2; it contains two special nodes, the source s and the sink t; all flow that goes through the graph starts at the s node and ends at the t node. Thereby, each flow represents a trajectory T k and the path that each flow follows indicates which observations belong to each of the trajectories. Each observation o i is represented with two nodes, the beginning node b i  ∈ N and the end node e i  ∈ N (see Fig. 11.2). A detection edge connects b i and e i .

Below we detail the three types of edges present in the graphical model and the cost for each type:

Link Edges: The edges (e i , b j ) connect the end nodes e i with the beginning nodes b j in following frames, with cost C i, j and flow f i, j , defined as:

$$\displaystyle\begin{array}{rcl} f_{i,j} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mathbf{o}_{i}\mbox{ and }\mathbf{o}_{j}\mbox{ belong to }T_{k}\mbox{ and }\varDelta f \leq F_{\mathrm{max}} \\ 0,\quad &\mbox{ otherwise} \end{array} \right.& &{}\end{array}$$
(11.12)

where Δ f is the frame number difference between nodes j and i and F max is the maximum allowed frame gap.

The costs of the link edges represent the spatial relation between different subjects. Assuming that a subject cannot move a lot from one frame to the next, we define the costs to be a decreasing function of the distance between detections in successive frames. The time gap between observations is also taken into account in order to be able to work at any frame rate, therefore velocity measures are used instead of distances. The velocities are mapped to probabilities with a Gauss error function as shown in Eq. (11.13), assuming the pedestrians cannot exceed a maximum velocity V max. The effect of parameter V max is detailed in Sect. 11.5.1.

$$\displaystyle\begin{array}{rcl} E(V _{t},V _{\mathrm{max}}) = \frac{1} {2} + \frac{1} {2}\mathrm{erf}\left (\frac{-V _{t} + \frac{V _{\mathrm{max}}} {2} } {\frac{V _{\mathrm{max}}} {4} } \right )& &{}\end{array}$$
(11.13)
Fig. 11.3
figure 3

(Color online) Blue = normalized histogram of speeds learned from training data. Red = probability distribution if cost depends linearly on the velocity. Green = probability distribution if the relation of cost and velocities is expressed by Eq. (11.13). An \(V _{\mathrm{max}} = 7\,\mathrm{m/s}\) is used in the experiments

As we can see in Fig. 11.3, the advantage of using Eq. (11.13) over a linear function is that the probability of lower velocities decreases more slowly, while the probability for higher velocities decreases more rapidly. This is consistent with the probability distribution of speed learned from training data.

Therefore, the cost of a link edge is defined as:

$$\displaystyle\begin{array}{rcl} C_{i,j}& =& -\log \left (P(\mathbf{o}_{j}\vert \mathbf{o}_{i})\right ) + C(\varDelta f) \\ & =& -\log E\left (\frac{\|\mathbf{p}_{j}-\mathbf{p}_{i})\|} {\varDelta t},V _{\mathrm{max}}\right ) + C(\varDelta f){}\end{array}$$
(11.14)

where \(C(\varDelta f) = -\log \left (B_{j}^{\varDelta f-1}\right )\) is the cost depending on the frame difference between detections.

Detection Edges: The edges (b i , e i ) connect the beginning node b i and end node e i , with cost C i and flow f i , defined as:

$$\displaystyle\begin{array}{rcl} f_{i} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mathbf{o}_{i}\mbox{ belongs to }T_{k} \\ 0,\quad &\mbox{ otherwise} \end{array} \right.& &{}\end{array}$$
(11.15)

If all the costs of the edges are positive, the solution to the minimum-cost problem is the trivial null flow. Consequently, we represent each observation with two nodes and a detection edge with negative cost:

$$\displaystyle\begin{array}{rcl} C_{i} =\log \left (1 - P_{det}(\mathbf{o}_{i})\right ) +\log \left ( \frac{\mathrm{BB}_{\mathrm{min}}} {\|\mathbf{p}_{\mathrm{BB}} -\mathbf{p}_{i})\|}\right ).& &{}\end{array}$$
(11.16)

The higher the likelihood of a detection P det (o i ) the more negative the cost of the detection edge, hence, confident detections are likely to be in the path of the flow in order to minimize the total cost. If a map of the scene is available, we can also include this information in the detection cost. If a detection is far away from a possible entry/exit point, we add an extra negative cost to the detection edge, in order to favor that observation to be matched. The added cost depends on the distance to the closest entry/exit point p BB, and is only computed for distances higher than BBmin = 1. 5m. This is a probabilistic simple way of including other information present in the scene, such as obstacles or attraction points (shops, doors, etc.).

Entrance and Exit Edges: The edges (s, e i ) connect the source s with all the end nodes e i , with cost C in, i and flow f in, i . Similarly, (b i , t) connects the end node b i with sink t, with cost C i, out and flow f i, out. The flows are defined as:

$$\displaystyle\begin{array}{rcl} f_{\mathrm{in},i}\mbox{ (or }f_{i,\mathrm{out}}\mbox{ )} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &T_{k}\mbox{ starts (or ends) at }\mathbf{o}_{i} \\ 0,\quad &\mbox{ otherwise} \end{array} \right.& &{}\end{array}$$
(11.17)

This connection, as shown in Fig. 11.4b, was proposed in [21] so that when a track starts (or ends) it does not benefit from the negative cost of the detection edge. Setting \(C_{\mathrm{in}} = C_{\mathrm{out}} = 0\) and taking into account the flow constraints of Eqs. (11.8) and (11.9), the trajectories are only created with the information of the link edges.

In contrast, the authors in [40] propose to create the opposite edges (s, b i ) and (e i , t), which means tracks entering and leaving the scene go through the detection node and therefore benefiting from its negative cost (see Fig. 11.4a). If the costs C in and C out are then set to zero, a track will be started at each detection of each frame, because it will be cheaper to use the entrance and exit edges than the link edges. On the other hand, if C in and C out are very high, it will be hard for the graph to create any trajectory. Therefore, the choice of these two costs is extremely important. In [40], the costs are set according to the entrance and exit probabilities P in and P out, which are data dependent terms that need to be calculated during optimization.

Fig. 11.4
figure 4

(a) Graph structure as used in [40], which requires the computation of P in and P out in an Expectation-Maximization step during optimization. (b) Graph structure as used in [21] which does not require the computation of these two parameters; the trajectories are found only with the information of the link and detection edges

3 Modeling Social Behavior

If a pedestrian does not encounter any obstacles, the natural path to follow is a straight line. But what happens when the space gets more and more crowded and the pedestrian can no longer follow the straight path? Social interaction between pedestrians is especially important when the environment is crowded. In this section we consider how to include the social behavior [15], which we divide into the Social Force Model (SFM) and the Group behavior (GR), into the minimum-cost network flow problem.

3.1 New MAP and Linear Programming Formulation

The original social force model [15] describes a physical system that predicts the position of a pedestrian in a continuous way, which has been successfully used for crowd simulation [28]. Nonetheless, we use the social information within another paradigm: in our Linear Programming system, we have a set of hypothetical pedestrian positions (in the form of nodes) and we apply the social forces to find out the probability of a certain match (i.e. a certain trajectory being followed by a pedestrian).

When including social and grouping information in the Linear Programming formulation, we can no longer assume that the motion of each subject is independent, which means we have to deal with a much larger search space of \(\mathcal{T}\).

We extend this space by including the following dependencies for each trajectory T k :

  • Constant velocity assumption: the observation \(\mathbf{o}_{k}^{t} \in T_{k}\) depends on past observations \([\mathbf{o}_{k}^{t-1},\mathbf{o}_{k}^{t-2}]\)

  • Grouping behavior: If T k belongs to a group, the set of members of the group \(\mathcal{T}_{k,\mathrm{GR}}\) has an influence on T k

  • Avoidance term: T k is affected by the set of trajectories \(\mathcal{T}_{k,\mathrm{SFM}}\) which are close to T k at some point in time and do not belong to the same group as T k

The first and third dependencies are grouped into the SFM term. The sets \(\mathcal{T}_{k,\mathrm{SFM}}\) and \(\mathcal{T}_{k,\mathrm{GR}}\) are disjoint, i.e., for a certain pedestrian k, the set of pedestrians that have an attractive effect (the group to which pedestrian k belongs to), is different from the pedestrians that have a repulsive effect on k. Therefore, we can assume that these two terms are independent and decompose \(P(\mathcal{T} )\) as:

$$\displaystyle\begin{array}{rcl} P(\mathcal{T} )& =& \prod _{T_{k}\in \mathcal{T}}P(T_{k} \cap \mathcal{T}_{k,\mathrm{SFM}} \cap \mathcal{T}_{k,\mathrm{GR}}) \\ & =& \prod _{T_{k}\in \mathcal{T}}P(\mathcal{T}_{k,\mathrm{SFM}}\vert T_{k})P(\mathcal{T}_{k,\mathrm{GR}}\vert T_{k})P(T_{k}){}\end{array}$$
(11.18)

Let us assume that we are analyzing observation o k t. In Fig. 11.5 we summarize which observations influence the matching of o k t. Typical approaches [40] only take into account distance (DIST) information, that is, the observation in the previous frame \(\mathbf{o}_{k}^{t-1}\). We introduce the social dependencies (SFM) given by the constant velocity assumption (green nodes) and the avoidance term (yellow nodes). In this case, two observations, o q t and o r t that do not belong to the same group as \(\mathbf{o}_{k}^{t}\), will be considered to create a repulsion effect on o k t. On the other hand, the orange nodes which depict the grouping term (GR), are two other observations o m t and o n t which do belong to the same group as o k t and therefore have an attraction effect on o k t. Note that all these dependencies can only be modeled by high order terms, which means that either we use complex solvers [29] to find a solution in graphs with cycles, or we keep the linearity of the problem by using an iterative approach as we explain later on.

Fig. 11.5
figure 5

Diagram of the dependencies for each observation o k t

The objective function is accordingly updated:

$$\displaystyle\begin{array}{rcl} \mathcal{T}{\ast}& =& \mathop{\mathop{\mathbf{argmin}}\nolimits }\limits_{\mathcal{T}}\sum _{T_{k}\in \mathcal{T}}-\log P(T_{k}) -\log P(\mathcal{T}_{\mathrm{SFM}}\vert T_{k}) \\ & & \quad -\log P(\mathcal{T}_{\mathrm{GR}}\vert T_{k}) +\sum _{k} -\log P(\mathbf{o}_{k}\vert \mathcal{T} ) \\ & =& \mathop{\mathop{\mathbf{argmin}}\nolimits }\limits_{\mathcal{T}}\sum _{i}C_{\mathrm{in},i}f_{\mathrm{in},i} +\sum _{i}C_{i,\mathrm{out}}f_{i,\mathrm{out}} \\ & & \quad +\sum _{i,j}(C_{i,j} + C_{\mathrm{SFM},i,j} + C_{\mathrm{GR},i,j})f_{i,j} +\sum _{i}C_{i}f_{i}{}\end{array}$$
(11.19)

3.2 Social Force Model

The social force model states that the motion of a pedestrian can be described as if they were subject to “social forces”. There are three main terms that need to be considered: the desire of a pedestrian to maintain a certain speed, the desire to keep a comfortable distance from other pedestrians and the desire to reach a destination. Since we cannot know a priori the destination of the pedestrian in a real tracking system, we focus on the first two terms.

Constant Velocity Assumption: The pedestrian tries to keep a certain speed and direction, therefore we assume that in t +Δ t we have the same speed as in t and predict the pedestrian’s position in t +Δ t accordingly.

$$\displaystyle{ \mathbf{\tilde{p}}_{i}^{t+\varDelta t} = \mathbf{p}_{ i}^{t} + \mathbf{v}_{ i}^{t}\varDelta t }$$

Avoidance Term: The pedestrian also tries to avoid collisions and keep a comfortable distance from other pedestrians. This term is modeled as a repulsion field with an exponential distance-decay function with value α learned from training data.

$$\displaystyle\begin{array}{rcl} \mathbf{a}_{i}^{t+\varDelta t} =\sum \limits _{ g_{m}\neq g_{i}}\exp \left (-\frac{\|\mathbf{\tilde{p}}_{i}^{t+\varDelta t} -\mathbf{\tilde{p}}_{m}^{t+\varDelta t}\|} {\alpha \varDelta t} \right )& &{}\end{array}$$
(11.20)

To compute the cost of edge (i, j), the constant velocity assumption is used to predict the position of o i and o j as well as the rest of pedestrians \(\mathbf{\tilde{p}}_{m}^{t+\varDelta t}\), and the repulsion acceleration each pedestrian has on i is also taken into account. The only pedestrians that have this repulsion effect on subject i are the ones which do not belong to the same group as i and \(\|\mathbf{\tilde{p}}_{i}^{t+\varDelta t} -\mathbf{\tilde{p}}_{m}^{t+\varDelta t}\| \leq 1m\). The different avoidance terms are combined linearly.

Now the prediction of the pedestrian’s next position is also influenced by the avoidance term (acceleration) from all pedestrians:

$$\displaystyle\begin{array}{rcl} \mathbf{\tilde{p}}_{i}^{t+\varDelta t} = \mathbf{p}_{ i}^{t} + (\mathbf{v}_{ i}^{t} + \mathbf{a}_{ i}^{t+\varDelta t}\varDelta t)\varDelta t& &{}\end{array}$$
(11.21)

The distance between prediction and real measurements is used to compute the cost:

$$\displaystyle\begin{array}{rcl} C_{\mathrm{SFM},i,j} = -\log E\left (\frac{\|\mathbf{\tilde{p}}_{i}^{t+\varDelta t} -\mathbf{p}_{j}^{t+\varDelta t}\|} {\varDelta t},V _{\mathrm{max}}\right )& &{}\end{array}$$
(11.22)

where the function E is detailed in Eq. (11.13).

In Fig. 11.6 we plot the probability distribution computed using different terms. Note, this is just for visualization purposes, since we do not compute the probability for each point on the scene, but only for the positions where the detector has fired. There are four pedestrians in the scene, the purple one and three green ones walking in a group. As shown in Fig. 11.6b, if we only use the predicted positions (yellow heads) given the previous speeds, there is a collision between the purple pedestrian and the green marked with a 1 collide. The avoidance term shifts the probability mode to a more plausible position.

Fig. 11.6
figure 6

(Color online) Three green pedestrians walk in a group, the predicted positions in the next frame are marked by yellow heads. The purple pedestrian’s linearly predicted position (yellow head) clearly interferes with the trajectory of the group. Representation of the probability (blue is 0 red is 1) distribution for the purple’s next position using: (a) only distances, (b) only SFM (constant velocity assumption and avoidance term), (c) only GR (considering the purple pedestrian belongs to the group), (d) distances+SFM and (e) distances+SFM+GR

3.3 Group Model

The social behavior [15] also includes an attraction force which occurs when a pedestrian is attracted to a friend, shop, etc. In this section, we show how to model the attraction between members of a group. Before modeling group behavior we need to determine which tracks form each group and at which frame the group begins and ends (to deal with splitting and formation of groups). The idea is that if two pedestrians are close to each other over a reasonable period of time, they are likely to belong to the same group. From the training sequence in [30], the distance and speed probability distributions of the members of a group P g vs. individual pedestrians P i is learned. If m and n are two trajectories which appear on the scene at t = [0, N], we compute the flag G m, n that indicates if m and n belong to the same group.

$$\displaystyle\begin{array}{rcl} G_{m,n} = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\sum \limits _{t=0}^{N}P_{g}(m,n)>\sum \limits _{ t=0}^{N}P_{i}(m,n) \\ 0,\quad &\mbox{ otherwise } \end{array} \right.& &{}\end{array}$$
(11.23)

For every observation o i , we define a group label g i which indicates to which group the observation belongs to, if any. If several pedestrians form a group, they tend to keep a similar speed, therefore, if i belongs to a group, we can use the mean speed of all the other members of the group to predict the next position for i:

$$\displaystyle\begin{array}{rcl} \mathbf{\tilde{p}}_{i}^{t+\varDelta t} = \mathbf{p}_{ i}^{t} +\sum \limits _{ g_{m}=g_{i}}\mathbf{v}_{m}^{t}\varDelta t& &{}\end{array}$$
(11.24)

The distance between this predicted position and the real measurements is used in (11.13) to obtain the cost for the grouping term.

An example is shown in Fig. 11.6c, where we can see that the maximum probability provided by the group term keeps the group configuration. In Fig. 11.6d we show the combined probability of the distance and SFM information, which narrows the space of probable positions. Finally, Fig. 11.6e represents the combined probability of DIST, SFM and GR. As we can see, the space of possible locations for the purple pedestrian is considerably reduced as we add the social and grouping behaviors, which means we have less ambiguities for data association. This is specially useful to decrease identity switches as we present in Sect. 11.5.

4 Optimization

To compute the SFM and grouping costs, we need to have information about the velocities of the pedestrians, which can only be obtained if we already have the trajectories. This chicken-and-egg problem is solved iteratively as shown in Algorithm 3; on the first iteration, the trajectories are estimated only with the information defined in Sect. 11.2.2, for the rest of iterations, the SFM and GR is also used. The algorithm stops when the trajectories do not change or when a maximum number of iterations M i is reached.

4.1 Linear Programming Solvers

The minimum cost solution is found using the Simplex algorithm [11], with the implementation given in [25]. Though Simplex has an exponential worst-case complexity, most sequences can be tracked in just a few seconds; this is because each node represents one detection, and therefore the dimension of the graph is quite small. For larger graphs [7] or more crowded environments, we can use the k-shortest paths solver [7, 31] which has a worst case complexity of \(O(k(m + n \cdot \mathrm{ log}(n)))\). For more details on network flows and Simplex we refer the reader to [1], and to [35] for more information on the k-shortest path algorithm.

4.2 Integrality of the Solution

When defining the program to be solved, we saw that Eq. (11.10) defined an integer program, which is known to be NP-complete. The condition is relaxed into Eq. (11.11) in order to use efficient Linear Programming solvers to find the optimum solution to our problem. If the solution to the relaxed version of the program is integer, then we know it is an optimal solution of the original problem [1]. The question is, can we guarantee that the solution will be always integer?

Let us assume the conditions of the Linear Program are expressed as: Ax = b. If all entries of A and b are integer, as it is our case, we can determine that Ax = b has an integer solution by Cramer’s rule:

$$\displaystyle\begin{array}{rcl} Ax = b\quad \Leftrightarrow \quad x = {A}^{-1}b\quad \Leftrightarrow \quad \forall i: x_{ i} = \frac{\mathrm{det}({A}^{i})} {\mathrm{det}(A)} & &{}\end{array}$$
(11.25)

where A i is equal to A except on the i-th column where it is equal to b. From here, we can determine that x will be integer when det(A) is equal to +1 or −1. A matrix A ∈ Z m×n is totally unimodular if the determinant of all the subsquare matrices of A is either 0, +1 or −1.

Theorem 11.1.

If A is totally unimodular, every vertex solution of Ax ≤ b is integer.

A well-known case of totally unimodular matrices are the node arc incidence matrices N of a directed network. Therefore, our defined constraint matrix is totally unimodular, and the solutions we will obtain will always be integer.

4.3 Computationally Reduction

To reduce the computational cost, the graph can be pruned using the physical constraints represented by the edge costs. If any of the costs C ij , C SFM, i, j or C GR, i, j is infinite, the two detections i and j are either two far away to belong to the same trajectory or they do not match according to social and grouping rules, therefore the edge (i, j) is erased from the graphical model. For long sequences, the video can be divided into several batches and optimize for each batch. For temporal consistency, the batches have an overlap of F max = 10 frames. The runtime of [21] for a sequence of 800 frames (114 s), 4,837 detections, batches of 100 frames and 6 iterations is 30 s on a 3 GHz machine.

5 Experimental Results

In this section we show the tracking results of several state-of-the-art methods on three publicly available datasets and compare them using the CLEAR metrics [17], which split the measuring scores into accuracy and precision:

  • Detection Accuracy (DA): measures how many detections where correctly found and therefore is based on the count of missed detections m t and false alarms f t for each frame t.

    $$\displaystyle\begin{array}{rcl} DA = 1 -\frac{\sum _{t=1}^{N_{f}}m_{ t} + f_{t}} {\sum _{t=1}^{N_{f}}N_{G}^{t}} & & {}\\ \end{array}$$

    where N f is the number of frames of the sequence and N G t is the number of ground truth detections in frame t. A detection is considered to be correct when it is found within 50 pixels from the ground truth and the bounding boxes of both ground truth and detection have some overlap.

  • Tracking Accuracy (TA): similar to DA but also including the identity switches i t . In this case, the measure does not penalize identity switches as much as a missing detection or a false alarm as we use a log10 weight.

    $$\displaystyle\begin{array}{rcl} DA = 1 -\frac{\sum _{t=1}^{N_{f}}m_{ t} + f_{t} + \mbox{ log}_{10}(1 + i_{t})} {\sum _{t=1}^{N_{f}}N_{G}^{t}} & & {}\\ \end{array}$$
  • Detection Precision (DP): precision measurements represent how well the bounding box detections match the ground truth. For this, an overlap measure between bounding boxes is used:

    $$\displaystyle\begin{array}{rcl} O{v}^{t} =\sum \limits _{ i=1}^{N_{\mathrm{mapped}}^{t} }\frac{\left \vert G_{i}^{t} \cap D_{i}^{t}\right \vert } {\left \vert G_{i}^{t} \cup D_{i}^{t}\right \vert }&& {}\\ \end{array}$$

    where N mapped t is the number of mapped objects in frame t, i.e., the number of detections that are matched to some ground truth object. G i t is the ith ground truth object of frame t and D i t the detected object matched to G i t. The DP measure is then expressed as:

    $$\displaystyle\begin{array}{rcl} DP = \frac{\sum \limits _{t=1}^{N_{f}} \frac{O{v}^{t}} {N_{\mathrm{mapped}}^{t}}} {N_{f}} & & {}\\ \end{array}$$
  • Tracking Precision (TP): measures the spatiotemporal overlap between ground truth trajectories and detected ones, taking into account also split and merged trajectories.

    $$\displaystyle\begin{array}{rcl} TP = \frac{\sum \limits _{i=1}^{N_{\mathrm{mapped}}^{t} }\sum \limits _{t=1}^{N_{f}}\frac{\left \vert G_{i}^{t}\cap D_{ i}^{t}\right \vert } {\left \vert G_{i}^{t}\cup D_{i}^{t}\right \vert }} {\sum \limits _{t=1}^{N_{f}}N_{\mathrm{mapped}}^{t}} & & {}\\ \end{array}$$

5.1 Analysis of the Effect of the Parameters

All parameters defined in previous sections are learned from training data using one sequence of the publicly available dataset [30]. In this section we study the effect of the few parameters needed in [21], and show the method works well for a wide range of these parameters and therefore no parameter tuning is needed to obtain a good performance. The analysis is done on two publicly available datasets: a crowded town center [5] and the well-known PETS2009 dataset [12], to see the different effects of each parameters on each dataset.

Fig. 11.7
figure 7

(Color online) Tracking accuracy (black) and precision (magenta) obtained for the Town Center dataset (left column) and the PETS 2009 dataset (right column) given varying parameter values. (a) TownCenter: iterations M i . (b) PETS2009: iterations M i . (c) TownCenter: maximum speed V max. (d) PETS2009: maximum speed V max. (e) TownCenter: B j V max. (f) PETS2009: B j

5.2 Number of Iterations

The first parameter we analyze is the number of iterations M i allowed. This determines how many times the loop between computing social forces and computing trajectories is performed as explained in Algorithm 1. Looking at the results on the PETS 2009 dataset in Fig. 11.7b, we can see that after just two iterations the results remain very stable. Actually, the algorithm reports no changes in the trajectories after three iterations, and therefore stops even though the maximum number of iterations allowed is higher. The result with one and two iterations is also not very different, which means the social and grouping behavior do not significantly improve the results for this particular dataset. This is due to the fact that this dataset is very challenging from a social behavior point of view, with subjects often changing direction and groups forming and splitting frequently. More details and comments on these results can be found in Sect. 11.5.6.2. On the other hand, we observe a different effect on the TownCenter dataset, shown in Fig. 11.7a. In this case, there is a clear improvement when using social and grouping behavior (i.e. the result improves when we use more than one iteration. We also observe a pattern on how the Tracking Accuracy of the dataset evolves: there is a cycle of three iterations for which the accuracy increases and decreases in a similar pattern. This means that the algorithm is jumping between two solutions and will not converge to neither one of them. This happens when pedestrians are close together for a long period of time but are not forming a group, which means that even with social forces, it is hard to say which paths they will follow.

5.3 Maximum Speed

This is the parameter that determines the maximum speed of the pedestrians that we are observing. In this case, we can see in Fig. 11.7c, d a clear trend in which the results are very bad when we force the pedestrians to walk more slowly that they actually do, since we are artificially splitting trajectories. The results converge when the maximum speed allowed is around 3–5 m/s, which is the reported mean speed of pedestrians in a normal situation. More interestingly, we observe that the results are kept constant when using higher maximum speed values. This is a positive effect of the global optimization framework, since we can use a much higher speed limit and this will still give us good results and will allow us to track a person running through the scene, a case of panic when people start running, etc.

5.4 Cost for the Frame Difference

The last parameter, B j , appears in Eq. (11.14) and represents the penalization term that we apply when the frame difference between two detections that we want to match is larger than 1. This term is used in order to give preference to matches that are close in time. Here we can again see different effects on the two datasets. In Fig. 11.7e, we see that the results are stable until a value of 0.4. The lower the value, the higher is the penalization cost for the frame difference, which means it is more difficult to match those detections which are more than 1 frame apart. When the value of B j is higher than 0.4, there are more ambiguities in the data association process because it is easier to match detections which are many frames apart. In the TownCenter dataset, there is no occluding object in the scene, which means missing detections are sporadic within a given trajectory. In this scenario, a lower value for B j is better, since small gaps can be filled and there are less ambiguities. Nonetheless, we see different results in the PETS 2009 dataset in Fig. 11.7f, since here there is a clear occluding object in the middle of the scene (see Fig. 11.8) which occludes the pedestrians for longer periods of time. In this case, a higher value of B j allows to overcome these large gaps of missing data, and that is why the best value for this dataset is around 0.6.

Fig. 11.8
figure 8

Four frames of the PETS2009 sequence (separation of nine frames), showing several occlusions, both created by the obstacle on the scene and between pedestrians. All the occlusions can be recovered with the proposed method

5.5 Evaluation with Missing Data, Noise and Outliers

We evaluate the impact of every component of the approach in [21] with one of the sequences of the dataset [30], which contains images from a crowded public place, with several groups as well as walking and standing pedestrians. The sequence is 11,601 frames long and contains more than 300 trajectories. First of all, the group detection method is evaluated on the whole sequence with ground truth detections: 61% are correctly detected, 26% are only partially detected, 13% are not found and an extra 7% groups are detected wrongly. All experiments are performed with 6 iterations, a batch of 100 frames, \(V _{\mathrm{max}} = 7\,\mathrm{m/s}\), F max = 10, α = 0. 5 and B j  = 0. 3.

Using the ground truth (GT) pedestrian positions as the baseline for our experiments, we perform three types of tests, missing data, outliers and noise, and compare the results obtained with:

  • DIST: proposed network model with distances

  • SFM: adding the Social Force Model (Sect. 11.3.2)

  • SFM+GR: adding SFM and grouping behavior (Sect. 11.3.3)

Missing Data: This experiment shows the robustness of our approach given missed detections. This is evaluated by randomly erasing a certain percentage of detections from the GT set. The percentages evaluated are [0, 4, 8, 12, 16, 20] from the total number of detections over the whole sequence. As we can see in Fig. 11.9, both SFM and SFM+GR increase the tracking accuracy when compared to DIST.

Outliers: With an initial set of detections of GT with 2% missing data, tests are performed with [0, 10, 20, 30, 40, 50] percentage of outliers added in random positions over the ground plane. In Fig. 11.9, the results show that the SFM is especially important when the tracker is dealing with outliers. With 50% of outliers, the identity switches with SFM+GR are reduced 70% w.r.t the DIST results.

Noise: This test is used to determine the performance of our approach given noisy detections, which are very common mainly due to small errors in the 2D-3D mapping. From the GT set with 2% missing data, random noise is added to every detection. The variances of the noise tested are [0, 0. 002, 0. 004, 0. 006, 0. 008, 0. 01] of the size of the scene observed. As expected, group information is the most robust to noise; if the position of pedestrian A is not correctly estimated, other pedestrians in the group will contribute to the estimation of the true trajectory of A.

These results corroborate that having good behavioral models becomes more important as the observations deteriorate. In Fig. 11.10 we plot the tracking results of a sequence with 12% simulated missing data. Only using distance information can see identity switches as shown in Fig. 11.10a. In Fig. 11.10b we can see how missing data affects the matching results. The matches are shifted, this chain reaction is due to the global optimization. In both cases, the use of SFM allows the tracker to interpolate the necessary detections and find the correct trajectories. Finally, in Fig. 11.10c we plot the wrong result which occurs because track 3 has two consecutive missing detections. Even with SFM, track 2 is switched for 3, since the switch does not create extreme changes in velocity. In this case, the grouping information is key to obtaining good tracking results. More results are shown in Fig. 11.13, first row.

Fig. 11.9
figure 9

(Color online) Experiments are repeated 50 times and average result, maximum and minimum are plotted. Blue star = results with DIST, Green diamond = results with SFM, Red square = results with SFM+GR. From left to right: Experiment with simulated missing data, with outliers, and with random noise

Fig. 11.10
figure 10

(Color online) Top row: Tracking results with only DIST. Bottom row: Tracking results with SFM+GR. Green = correct trajectories, Blue = observation missing from the set, Red = wrong match. (a) Wrong match with DIST, corrected with SFM. (b) Missing detections cause the matches to shift due the global optimization; correct result with SFM.  (c) Missed detection for subject 3 on two consecutive frames. With SFM, subject 2 in the first frame (yellow arrow) is matched to subject 3 in the last frame (yellow arrow), creating an identity switch; correct result with grouping information

5.6 Tracking Results

In this section, we compare results of several state-of-the-art methods on two publicly available datasets: a crowded town center [5] and the well-known PETS2009 dataset [12]. We compare results obtained with:

  • Benfold et al. [5]: using the results provided by the authors for full pedestrian detections. The HOG detections are also given by the authors and used as input for all experiments.

  • Zhan et al. [40]: globally optimum tracking based on network flow linear programming.

  • Pellegrini et al. [30]: tracker based on Kalman Filter which includes social behavior.

  • Yamaguchi et al. [38]: tracker based on Kalman Filter which includes social and grouping behavior.

  • Leal-Taixé et al. [21]: globally optimum tracking based on network flow linear programming and including social and grouping behavior.

For a fair comparison, we do not use appearance information for any method. The methods [5, 30, 38] are online, while [21, 40] processes the video in batches. For these last two methods, all experiments are performed with six iterations, a batch of 100 frames, \(V _{\mathrm{max}} = 7\,\mathrm{m/s}\), F max = 10, α = 0. 5 and B j  = 0. 3.

5.6.1 Town Center Dataset

We perform tracking experiments on a video of a crowded town center [5], using one of every ten frames (simulating 2.5 fps). We show detection accuracy (DA), tracking accuracy (TA), detection precision (DP) and tracking precision (TP) measures as well as the number of identity switches (IDsw).

Table 11.1 Town Center sequence
Fig. 11.11
figure 11

Predictive approaches [30, 38] (first row) vs. Proposed method (second row)

Note, the precision reported in [5] is about 9% higher than the input detections precision; this is because the authors use the motion estimation obtained with a KLT feature tracker to improve the exact position of the detections, while we use the raw detections. Still, our algorithm reports 64% less ID switches. As shown in Table 11.1, [21] algorithm outperforms [30, 38], both of which include social behavior information, by almost 4% in accuracy and with 50% less ID switches. In Fig. 11.11 we can see an example where [30, 38] fail. The errors are created in the greedy phase of predictive approaches, where people fight for detections. The red false detection in the first frame takes the detection in the second frame that should belong to the green trajectory (which ends in the first frame). In the third frame, the red trajectory overtakes the yellow trajectory and a new blue trajectory starts where the green should have been. None of the resulting trajectories violate the SFM and GR conditions. On the other hand, a global optimization framework takes full advantage of the SFM and GR information and correctly recovers all the trajectories. More results of the proposed algorithm can be seen in Fig. 11.13, last row.

5.6.2 Results on the PETS2009 Dataset

In addition, we present results of monocular tracking on the PETS2009 sequence L1, View 1 with the detections obtained using the Mixture of Gaussians (MOG) background subtraction method. We compare the results with the previously described methods plus the monocular result of View 1 presented in [7], where the detections are obtained using the Probabilistic Occupancy Map (POM) and the tracking is done using k-shortest paths (Fig. 11.12).

Fig. 11.12
figure 12

Results of the proposed method on the PETS2009 dataset views 1. (a) Detection accuracy, DA. (b) Detection precision, DP. (c) Tracking accuracy, TA. (d) Tracking precision, TP

Fig. 11.13
figure 13

First row: Results on the BIWI dataset (Sect. 11.5.5). The scene is heavily crowded, social and grouping behavior are key to obtaining good tracking results. Second and third rows: Results on the PETS2009 dataset (Sect. 11.5.6.2). Last two rows: Results on the Town Center dataset (Sect. 11.5.6.1)

The first observation that we make is that the linear programming methods (LP and LP+SFM+GR) clearly outperform predictive approaches in accuracy. This is because this dataset is very challenging from a social behavior point of view, because the subjects often change direction and groups form and split frequently. Approaches based on a probabilistic framework [21, 40] are better suited for unexpected behavior changes (like destination changes), where other predictive approaches fail [30, 38]. We can also see that the LP+SFM+GR method has a higher accuracy than the LP method, which does not take into account social and grouping behavior. The grouping term is specially useful to avoid identity switches between member of a group (see an example in Fig. 11.13, third row, the cyan and green pedestrian who walk together). Precision is similar for all methods since the same detections have been used for all the experiments and we do not apply smoothing or correction of the bounding boxes.

6 Conclusions

In this chapter, we presented an overview of methods that integrate pedestrian interaction into a tracking framework in two ways: using a globally optimum solver or improving the dynamic model with social forces. Furthermore, we presented how to combine the strength of both approaches by finding the MAP estimate of the trajectories total posterior including social and grouping models using a minimum-cost network flow with an improved novel graph structure that outperforms existing approaches. People interaction is persistent rather than transient, hence the probabilistic formulation fully exploits the power of behavioral models as opposed to standard predictive and recursive approaches such as Kalman filtering. Experiments on three public datasets reveal the importance of using social interaction models for tracking in difficult conditions such as in crowded scenes with the presence of missed detections, false alarms and noise.