1 Introduction

2D-3D pose estimation aims to determine the pose of a known 3D object from a single 2D image relative to a calibrated camera. For the case of a rigid body, its pose may be described by a 6 DOF (degrees of freedom) transformation, consisting of 3 displacement parameters and 3 rotation parameters. 3D pose estimation is commonly used as a basis for 3D tracking. 3D tracking has many applications among them—visual servoing of robotic arms, augmented reality applications such as medical visualization, entertainment and target tracking, see Lepetit and Fua (2005) for a complete survey (Fig. 1).

Fig. 1
figure 1

Summarized performance of analysis: probability of estimating the correct pose of homogeneous and heterogeneous objects

1.1 Motivation

Recently, Prisacariu and Reid (2012) presented the PWP3D algorithm for simultaneous 2D-3D pose estimation and image segmentation using a known 3D model. Following the assumption that the 3D pose of the object corresponds with the optimal segmentation of the image into foreground and background, Prisacariu and Reid (2012) define an energy function which measures the quality of fit of global appearance models used to describe each one of the regions. In this context the foreground region is the projection of the object on to the image plane, the background is the complementary region and the segmentation is a measure of the statistical fit of the pixels within each region. The appearance models, are adopted from the generative pixel-wise model presented by Bibby and Reid (2008), where the appearance models are described using posterior probability functions, rather than commonly used likelihood probability functions. Next, analytic expressions for the energy gradients with respect to the object’s pose parameters are derived, and standard gradient-based minimization is applied.

The PWP3D algorithm (Prisacariu and Reid 2012) achieves state of the art performance, while keeping a low computation cost. Using a Geforce GPU with parallelizing the code, the algorithm runs at real-time. However, running the algorithm in complex scenes, containing heterogeneous objects or a cluttered background reveals a significant degradation in performances. In this context we define heterogeneity, as spatial variation in statistical properties. We demonstrate this with two examples, where the 3D pose of an object is estimated in a complex scene. In both examples the object’s pose parameters are initialized to the ground truth, i.e., the algorithm begins when the object is at the correct pose, in order to avoid possible dependency on the optimization algorithm.

Fig. 2
figure 2

Pose estimation in complex scenes using global appearance models

  1. 1.

    The scene in the first example (Fig. 2a) is synthetic, comprising a non-homogeneous duck and a non-homogeneous background. Most of the duck’s pixels are light colored, however the top of its head has a small dark region. The background too is mostly light colored, with dark distant regions. The object’s pose estimated by the algorithm is depicted by the green contour in Fig. 2a. The incorrect pose is a result of the global appearance models, used by the PWP3D algorithm, which associate the dark regions of the scene with the background, causing the duck to shift away from dark regions, placing the object in light-pixel areas.

  2. 2.

    In the second example (Fig. 2b) there is a heterogeneous driller object from the ACCV database (Hinterstoisser et al. 2012). The driller is mostly green, however its head is black. The background is mostly cluttered, with a significant black region at the bottom of the image. Once again the black pixels are associated with the background, causing the object to drift away from the correct pose. The object’s estimated pose is depicted by the green contour in the image.

Theses examples show that describing complex scenes, containing significant spatial variation of statistical properties, using global appearance models does not give a sufficient description of the scene, leading to an incorrect pose estimation. We suggest applying ideas from local active contours (Lankton and Tannenbaum 2008) to develop appearance models which will capture the spatial variation in the foreground and background regions. We define multiple local regions centered around the 2D contour points. Each local region comprises a local foreground and a local background region, as illustrated in Fig. 3. In this figure a single local region, centered at one of the contour points, is shown with its separation to local foreground and local background. For each local region we define a local energy function measuring the segmentation quality within that region. Next, we define an energy function which fuses together the local energies, and which we optimize, with respect to the pose parameters. We applied the localized algorithm to the scenes in Fig. 2, where the PWP3D failed. The results using the localized appearance are shown in Fig. 4a, b. As depicted by the objects’ contours the localized algorithm successfully estimates their correct pose. The local regions used are circles with a radius of 30 pixels, which are shown along the objects’ contours.

Fig. 3
figure 3

Single local region

Fig. 4
figure 4

Pose estimation in complex scenes using local appearance models

We demonstrate our algorithm’s improvement over the PWP3D algorithm, by measuring the basin of attraction of the rotation angle across all axes, translation in X and Y and scale across multiple homogeneous and heterogeneous models (Fig. 1). In each experiment the algorithms are applied after initializing the object’s pose to a random initial error. Success is defined as a final rotational error of not more than 10 degrees and 10% of object size in translation. The probability of error, per initial error bin, is defined as the ratio between the number of cases where the pose is estimated successfully out of the total number of cases in that bin. The full details of the experiment are provided in Sect. 5. This figure shows a dramatic improvement when using the localized algorithm for heterogeneous objects, and little improvement for homogeneous objects both in translation, rotation and scale experiments.

1.2 Relation to 2D-3D Pose Estimation Approaches

The 3D pose estimation and tracking literature is very extensive, and exceeds the scope of this paper, we present a short review of the existing methods. We follow the approach of Lepetit and Fua (2005) and divide the approaches into two major types:

  1. 1.

    Edge based methods—these methods match the 3D object’s projected edges with those in the image.

  2. 2.

    Methods which rely on information inside the object’s projection.

The edge based methods may rely on strong gradients in the image without explicitly extracting contours e.g., the RAPiD tracker of Harris and Stennet (1990) or on explicit contours of the object e.g., Lowe (1987). The main drawback of this approach is numerous local minima (Brox et al. 2010), and sensitivity to noise or missing information (Dambreville et al. 2010).

Methods which rely on information inside the object’s projection include:

  • Methods which rely on local interest points—e.g., SIFT (Scale Invariant Feature Transform) points of Lowe (2004).SIFT points possess very important attributes: they are invariant to scale, rotation and constant illumination changes. The majority of 3D pose estimation and 3D object recognition work (e.g. Arie-Nachimson and Basri 2009; Savarese and Li 2007) performed today relies on these features. However, these features capture mostly the texture of the scene. This could be problematic in several scenarios: in the case where the object’s texture is not known a-priori or changes in the scene (e.g. due to dirt), or in the case of a texture-less object.

  • Region-based approaches, which is the focus of this paper, assume that the object’s pose corresponds with the optimal segmentation into foreground and background. Using a known 3D model the foreground region is defined as the object’s silhouette and the complementary region as the background. Region based methods are well proven for active contour segmentation of the foreground from the background (e.g., Chan and Vese 2001). The foreground is referred to as the interior of a contour, whereas the background is the exterior of the contour. Typically, an energy functional is defined, comprising the quality of fit of each of the regions and a penalty term, limiting the contour length. The contour is found by iteratively propagating each point in the direction which optimizes the energy functional.

More recently, with the increased availability of cameras capable of depth measurement, several methods have been proposed (Tan and Ilic 2014; Brachmann et al. 2014) using depth information.

1.3 Region Based 2D-3D Pose Estimation

Several approaches have been suggested in the context of combined image segmentation and 3D pose estimation or 3D tracking.

Rosenhahn et al. (2007) extend the classical region-based segmentation energy by a 2D shape similarity term, which measures the distance between the evolved curve and the projection of the 3D shape onto the image plane. This term restricts the contour propagation to the vicinity of the object’s contour. Every iteration comprises two main stages:

  1. 1.

    The curve is propagated in the direction which optimizes the energy function.

  2. 2.

    A correspondence between the 3D model and the curve is estimated, in terms of a 6 DOF transformation. The transformation is applied to the 3D model and the curve is reinitialized according to the 3D projected curve.

In a later version, Schmaltz et al. (2007) simplify the calculations by eliminating the contour propagation stage and performing the optimization directly with respect to the pose parameters. The energy functional is computed based on classical region based terms, without the shape similarity term. Every point along the contour is assigned a force in the direction normal to the curve. The sign of the force, exterior or interior to the curve, is determined according to the region achieving a better energy value. Using 2D-3D point correspondences the 2D force is translated to a 3D force direction, which results in a rigid body transformation. Later, Schmaltz et al. (2009) suggested an integrated tracking system comprising the region based approach from Rosenhahn et al. (2007), complemented by a 2D SIFT tracker and optical flow for motion estimation between frames.

Dambreville et al. (2010) define an energy function in terms of the 3D surface model, which is assumed to be known, and its pose parameters. In contrast to Schmaltz et al. (2007), Dambreville et al. (2010) use differential geometrical tools to calculate the gradient of the energy with respect to the pose parameters, and propagate the object’s pose in this direction. This approach has a strong advantage as it allows propagating the pose parameters in the direction of the optimal segmentation. Their algorithm consists of the following steps:

  1. 1.

    Initialize pose parameters.

  2. 2.

    For each iteration:

    1. (a)

      Project 3D surface onto the image plane.

    2. (b)

      Estimate the PDFs of the foreground and background regions, defined by the object’s silhouette.

    3. (c)

      Calculate the gradient of the segmentation energy functional in terms of the pose parameters.

    4. (d)

      Propagate pose parameters in the direction of the optimal energy.

The PWP3D algorithm of Prisacariu and Reid (2012) follows an approach similar to Dambreville et al. (2010), while simplifying the calculations using level set functions. In contrast to Dambreville et al. (2010), who formulate the energy function using of two separate surface integrals over the foreground and the background, Prisacariu and Reid (2012) formulate the energy functions in terms of to sums over, using a Heaviside function evaluated over the embedding function, used to delineate each region. This step simplifies the energy gradient with respect to the pose parameters—the integral over the 3D occluding curve, is replaced with a sum over the 2D contour. Additionally, Prisacariu and Reid (2012) follow Bibby and Reid (2008), and define appearance models using posterior probability functions, instead of likelihood functions used by Dambreville et al. (2010). Bibby and Reid (2008) show that appearance models which rely on posterior probability functions are advantageous over likelihood functions. The PWP3D algorithm achieves similar results to those of the integrated approach suggested by Schmaltz et al. (2009), while running at real-time.

However, as we showed in Fig. 2 the global appearance models may be insufficient in capturing spatial variations in the foreground and background. Hence, a more sophisticated model which takes into account local variations is required. Thus, we suggest to apply ideas from local active contours (Lankton and Tannenbaum 2008) in order derive appearance models which capture the spatial variation in the foreground and background regions.

1.4 Localized 2D Image Segmentation

The idea of localizing segmentation calculations has been proposed in the past in different contexts. Rosenhahn et al. (2007) model the regions using varying local Gaussian probabilities. For each pixel they define a small window which is used to estimate the mean and standard deviation.

Schmaltz et al. (2009) concentrate on a free form surface consisting of rigid parts interconnected by predefined joints. Each part has its own appearance model, and the background is separated into multiple sub-regions, modeled using mixture models. They assume the background is static or slowly varying. They propose two algorithms for segmenting the background— K Means, which requires knowing the model order, or a level set algorithm which optimizes the number of regions. Assigning localized region models to different parts, could indeed have many advantages, when such a division is known a-priori. However, selecting the correct model order for the background segmentation could be a difficult task.

Horbert et al. (2011) combines the Implicit Shape Model Leibe et al. (2004) with a localized version of the posterior pixel-wise appearance models shown by Bibby and Reid (2008), focused on segmentation and tracking of pedestrians. This work attempts to capture the spatial variability using two appearance models for the foreground, one for the upper body parts and the second lower body parts, and two appearance models for the background. While separating the object into two regions seems like an adequate approach, modeling the background in terms of two appearance models may be insufficient in cluttered scenes.

Lankton and Tannenbaum (2008) present a framework for localizing active contours, i.e., propagating the active contour based on local region appearance models. For each point along the curve a local region is defined, and is split into a part interior and a part exterior to the curve. A local energy measuring the segmentation match between the two regions is defined. Each point is propagated in the direction that maximizes the segmentation, independently from other local decisions. This approach is very suitable to this segmentation problem as it addresses the spatial variation both in the foreground and in the background, without assuming any prior knowledge of the foreground and background. In contrast to Rosenhahn et al. (2007), where localized Gaussian distributions are defined, no assumptions are made regarding the probability functions. A key issue when using local region statistics is the size of the local regions. While Gaussian models are sufficient for very small regions, as the region sizes increase Gaussian models become insufficient and a more complex model must be considered.

1.5 Contribution

We propose a framework for simultaneous 3D pose estimation and image segmentation using local region statistics, instead of the global region statistics used in standard formulation. Local region statistics are capable of capturing spatial variation in image statistics. Thus we improve the 3D pose estimation in scenes containing heterogeneous objects or cluttered backgrounds.

We present the framework on the basis of the PWP3D algorithm (Prisacariu and Reid 2012), however it can be applied to other global region based methods, e.g., Dambreville et al. (2010).

We define a local energy function, which measures the segmentation quality within a local region. We fuse together the local energy functions into a single energy function and optimize it with respect to the pose parameters.

Finally, we present extensive experiments performed using the ACCV database (Hinterstoisser et al. 2012) comparing our algorithm’s performance with the PWP3D algorithm (Prisacariu and Reid 2012). Pose estimation algorithms, such as PWP3D, may be applied as a first stage in more advanced algorithms (e.g., Dame et al. 2013). Enriching the basic algorithm by considering non homogeneous objects directly enriches each such advanced algorithm.

Structure of the paper—in Sect. 2 we present our approach to local region based pose estimation. Next, in Sect. 3 we discuss local region size selection. In Sect. 4 we discuss the implementation details. In Sect. 5 we present our results compared with the PWP3D algorithm. In Sect. 6 we make concluding remarks and give direction to future research.

2 Proposed Approach

In this section we present our proposed framework for extending the PWP3D algorithm (Prisacariu and Reid 2012) using local region statistics, instead of global region statistics. This section is divided into three parts—In Sect. 2.1 we present the formulation of the problem. In Sect. 2.2 we review the main steps of the PWP3D algorithm. In Sect. 2.3 we present our localized extension, highlighting the differences between our algorithm and the PWP3D.

We assume we are given a 3D surface model of an object located in an input image from a known calibrated camera. Our objective is to find the transformation parameters that map the object’s model and the object in the image.

Our algorithm relies an initialization of the pose parameters, and iteratively propagates the pose parameters using the gradients with respect to the pose parameters. The outline of our algorithm may be described using the following steps:

  1. 1.

    Initialize pose parameters. \(\lambda = \lambda _0 \)

  2. 2.

    For each iteration:

    1. (a)

      Apply 3D transformation to object.

    2. (b)

      Project 3D model on to image plane.

    3. (c)

      For each local region:

      1. (i)

        Estimate local region statistics.

      2. (ii)

        Calculate local energy gradient with respect to pose parameters, \( \nabla E_n \).

    4. (d)

      Fuse local region gradients, \( \nabla E = f \left( \nabla E_n \right) \).

    5. (e)

      Find optimal step size, s .

    6. (f)

      Update pose parameters \( \lambda = \lambda - s \nabla E \).

Where f is a function which fuses the local gradients, s is the step size and \( \lambda \) is the pose parameters vector. These steps are explained in depth throughout this section. Steps 2(a)-2(b), applying the 3D transformation and projecting the object on to image, are illustrated using the driller model in Fig. 5. Step 2(c), estimating the local region statistics, is shown for two different local regions in Fig. 6 and in Fig. 8. Their corresponding local foreground probability functions are shown in shown in Figs. 7b and 9b. Their corresponding local background probability functions are shown in shown in Figs. 7a and 9a. Each local region is affected by different elements—the background of the first local region is strongly affected by the blue bench vise, whereas the background of the second local region is strongly affected by the red ape model. These examples demonstrate the problem of describing the background and foreground regions using single appearance models when there is a strong spatial variation. This variation in statistical properties within each one of the regions makes it unreasonable to use a single appearance model to describe the entire region.

Fig. 5
figure 5

Driller object projection onto the image plane

Fig. 6
figure 6

Example of a local region extraction, divided into the local foreground (red) and local background (blue)

Fig. 7
figure 7

Probability density functions of local region

Fig. 8
figure 8

Example of a local region extraction, divided into the local foreground (red) and local background (blue)

Fig. 9
figure 9

Probability density functions of Local region

2.1 Model

We begin by defining the rigid body transformation, which maps points from the object coordinate frame to the camera coordinate frame. A 3D point in the camera coordinate frame is denoted by \(\varvec{X}=[X,Y,Z]^{T} = \varvec{R} \varvec{X_{0}} + \varvec{T} \in \mathbb {R}^{3}\). Where:

  • \(\varvec{X_{0}} = [X_{0},Y_{0},Z_{0}]^{T}\) is the corresponding point in the object coordinate frame.

  • \(\varvec{T} = [\lambda _{1}, \lambda _{2}, \lambda _{3}]^{T} \) denotes the translation vector in the xyz directions respectively.

  • \( \varvec{R} \) is a rotation matrix represented in canonical exponential coordinates. It can be shown (Ma et al. 2003) that for any rotation matrix \( R \in SO(3) \) (Lie group) there exists a \( w \in \mathbb {R}^{3}\), \( \Vert w \Vert =1 \) and \( t \in \mathbb {R}^{3} \) such that:

    $$\begin{aligned} \varvec{R} = \exp ( \varvec{\hat{w}} t) \end{aligned}$$
    (1)

    Where, \( \varvec{\hat{w}} \) is the skew symmetric matrix of the unit vector w . The unit vector w is the rotation axis and t is the rotation size in radians. We define the rotation vector as \( [\lambda _{4}, \lambda _{5}, \lambda _{6}] = wt \), and a skew symmetric matrix:

    $$\begin{aligned} \varvec{\Lambda }= \left[ \begin{array}{c@{\quad }c@{\quad }c} 0 &{} -\lambda _{6} &{} \lambda _{5}\\ \lambda _{6} &{} 0 &{} -\lambda _{4}\\ -\lambda _{5} &{} \lambda _{4} &{} 0 \end{array}\right] \end{aligned}$$
    (2)

    The rotation matrix is given by :

    $$\begin{aligned} \varvec{R} = \exp (\varvec{\Lambda }) \end{aligned}$$
    (3)

The choice of exponential coordinates is merely due to their simplicity. Intrinsic camera parameters:

  • \(\left( f_{x},f_{y}\right) \) are the focal distance in the x, y axes.

  • \(\left( u_{0},v_{0}\right) \) the principal points of the camera.

2.2 Global Region Based Pose Estimation

In this subsection we review the PWP3D global region based pose estimation framework developed by Prisacariu and Reid (2012), we reference the relevant equation numbers from the original work. In the following subsection present our extension of it to local region based 2D-3D pose estimation. Table 1 defines the notation that will be used in the context of global region framework: In Fig. 10 we illustrate the problem in terms of global regions properties.

Table 1 Global region notation
Fig. 10
figure 10

Global region model

Prisacariu and Reid (2012) showed that assuming pixel-wise independence the posterior probability of the shape of the contour, given the image data (Eq. 4):

$$\begin{aligned} P\left( \varPhi \mid I\right)&= \underset{\varvec{x}\in \varOmega }{\prod } \left[ P_{f}\left( \varvec{y}\right) H_{e}\left( \varPhi (\varvec{x})\right) \right. \nonumber \\&\left. \quad +P_{b}\left( \varvec{y}\right) \left( 1-H_{e}\left( \varPhi (\varvec{x})\right) \right) \right] \end{aligned}$$
(4)

Equation 2.2 describes the probability of the shape kernel \( \varPhi \), defined by the constant and known 3D model, and the unknown pose parameters given the image data. Hence, it can be thought of as the posterior pose parameters probability given the image data. Where (Eqs. 7, 8):

$$\begin{aligned}&P_{f}\left( \varvec{y}\right) =\frac{P\left( \varvec{y}\mid M_{f}\right) \eta _{f}}{P\left( \varvec{y}\mid M_{f}\right) \eta _{f}+P\left( \varvec{y}\mid M_{b}\right) \eta _{b}}\\&P_{b}\left( \varvec{y}\right) =\frac{P\left( \varvec{y}\mid M_{b}\right) \eta _{b}}{P\left( \varvec{y}\mid M_{f}\right) \eta _{f}+P\left( \varvec{y}\mid M_{b}\right) \eta _{b}}\\&\eta _{f}=\underset{\varvec{x} \in \varOmega }{\sum }H_{e}\left( \varPhi \left( \varvec{x}\right) \right) ,\,\eta _{b}=\underset{\varvec{x}\in \varOmega }{\sum }\left[ 1-H_{e}\left( \varPhi \left( \varvec{x}\right) \right) \right] \end{aligned}$$

\( \varPhi \) is the distance transform defined as:

$$\begin{aligned} \varPhi \left( \varvec{x}\right) ={\left\{ \begin{array}{ll} -d &{} \hbox { for }\quad {\varvec{x}}\quad \hbox { inside the silhouette} \\ d &{} \hbox { for }\quad {\varvec{x}}\quad \hbox {outside the silhouette} \end{array}\right. } \end{aligned}$$

Where d is the shortest distance between the pixel location \( \varvec{x} \) and the object’s 2D contour. An example of a distance transform applied to the driller model is presented in Fig. 11, along with the 2D contour. The color bar to the side indicates the distance transform value, i.e., the signed minimal distance to object’s contour from every pixel.

Fig. 11
figure 11

Distance transform applied to driller

The energy function is defined as the negative log posterior probability (Eqs. 5, 6):

$$\begin{aligned} E= & {} -\hbox {log}P\left( \varPhi \mid I\right) \\ \nonumber= & {} \underset{\varvec{x}\in \varOmega }{-\sum }\hbox {log}\left[ P_{f}H_{e}\left( \varPhi \right) +P_{b}\left( 1-H_{e}\left( \varPhi \right) \right) \right] \end{aligned}$$
(5)

The conditional probability functions may be estimated using a smoothed histogram, of the photometric variable chosen to perform the segmentation. We follow the choice of Prisacariu and Reid (2012) and use the photometric intensity of the RGB channels. A more sophisticated selection could be the usage of texture features, which could be important for textured objects (e.g., Zebra) as performed by Rosenhahn et al. (2007). Next, the energy function derivatives are calculated with respect to the pose parameters (Eqs. 11, 12):

$$\begin{aligned} \frac{\partial E}{\partial \lambda _{i}}=-\sum _{\varvec{x}\in \varOmega }\frac{P_{f}-P_{b}}{P_{f}H_{e}\left( \varPhi \right) +P_{b}\left( 1-H_{e}\left( \varPhi \right) \right) } \frac{\partial H_{e}\left( \varPhi \right) }{\partial \lambda _{i}} \end{aligned}$$
(6)

This equation is comprised of two components—the first (left hand part), relies on statistical properties estimated. The second, is a function of the objects geometry, independent of the statistical properties. Applying the chain rule we get:

$$\begin{aligned} \frac{\partial H_{e}\left( \varPhi \right) }{\partial \lambda _{i}}&= \frac{\partial H_{e}\left( \varPhi \right) }{\partial \varPhi }\left[ \frac{\partial \varPhi }{\partial u}\frac{\partial u}{\partial \lambda _{i}}+\frac{\partial \varPhi }{\partial v}\frac{\partial v}{\partial \lambda _{i}}\right]&\\&= \delta _{e}\left( \varPhi \right) \left[ \frac{\partial \varPhi }{\partial u}\frac{\partial u}{\partial \lambda _{i}}+\frac{\partial \varPhi }{\partial v}\frac{\partial v}{\partial \lambda _{i}}\right] \nonumber \end{aligned}$$
(7)

Substituting the camera model (Eqs. 13, 14):

$$\begin{aligned}&\left[ \begin{array}{c} u\\ v \end{array}\right] =\left[ \begin{array}{c} \frac{X}{Z}f_{x}+u_{0}\\ \frac{Y}{Z}f_{y}+v_{0} \end{array}\right] \nonumber \\&\frac{\partial u}{\partial \lambda _{i}}=f_{x}\frac{\partial }{\partial \lambda _{i}}\frac{X}{Z}=f_{x}\frac{1}{Z^{2}}\left( Z\frac{\partial X}{\partial \lambda _{i}}-X\frac{\partial Z}{\partial \lambda _{i}}\right) \end{aligned}$$
(8)
$$\begin{aligned}&\frac{\partial v}{\partial \lambda _{i}}=f_{y}\frac{\partial }{\partial \lambda _{i}}\frac{Y}{Z}=f_{y}\frac{1}{Z^{2}}\left( Z\frac{\partial Y}{\partial \lambda _{i}}-Y\frac{\partial Z}{\partial \lambda _{i}}\right) \end{aligned}$$
(9)

The differentials with respect to the translation parameters (\( \lambda _1, \lambda _2, \lambda _3 \) ) are given by:

$$\begin{aligned} \frac{\partial \varvec{X}_{j}}{\partial \lambda _{i}}=\delta _{i,j}\qquad i,j=1,2,3 \end{aligned}$$

The differentials with respect to the rotation parameters (\( \lambda _4, \lambda _5, \lambda _6 \)) are given by:

$$\begin{aligned} \frac{\partial }{\partial \lambda _{j}}\varvec{X} = \frac{\partial }{\partial \lambda _{j}}\varvec{R}\left[ \begin{array}{c} X_{0}\\ Y_{0}\\ Z_{0} \end{array}\right] \\ \end{aligned}$$

Where:

$$\begin{aligned} \frac{\partial }{\partial \lambda _{j}} \varvec{R} = \exp (\varvec{\Lambda }) \left[ \begin{array}{c@{\quad }c@{\quad }c} 0 &{} -\delta _{j,6} &{} \delta _{j,5}\\ \delta _{j,6} &{} 0 &{} -\delta _{j,4}\\ -\delta _{j,5} &{} \delta _{j,4} &{} 0 \end{array}\right] , j=4,5,6 \end{aligned}$$
(10)

Finally arriving at:

$$\begin{aligned} \frac{\partial }{\partial \lambda _{j}}\varvec{X} = \varvec{R}\left[ \begin{array}{c@{\quad }c@{\quad }c} 0 &{} -\delta _{j,6} &{} \delta _{j,5}\\ \delta _{j,6} &{} 0 &{} -\delta _{j,4}\\ -\delta _{j,5} &{} \delta _{j,4} &{} 0 \end{array}\right] \left[ \begin{array}{c} X_{0}\\ Y_{0}\\ Z_{0} \end{array}\right] , j=4,5,6\nonumber \\ \end{aligned}$$
(11)

2.3 Local Region Based Pose Estimation

In this subsection we present our localized region based 3D pose estimation model. We define the notation that will be used in the context of local region framework in Table 2. The notation defined in the scope of global region segmentation remains unaffected.

Table 2 Local region notation
Fig. 12
figure 12

Local region model

We illustrate the local region parameters in Fig. 12. Following the approach used by Lankton and Tannenbaum (2008) we define a characteristic function, \(\varvec{B}_{n}(\varvec{x}_{i})\), which masks local regions. The subscript n denotes the local region index:

$$\begin{aligned} \varvec{B}_{n}(\varvec{x}_{i})={\left\{ \begin{array}{ll} 1 &{} \varvec{x}_{i}\in \varOmega _{n}\\ 0 &{} \varvec{x}_{i}\notin \varOmega _{n} \end{array}\right. } \end{aligned}$$

Specifically, we select \(\varvec{B}_{n}(\varvec{x}_{i},\varvec{x}_{c})\) as a circular binary mask centered at \(\varvec{x}_{c}\), with a radius size d:

$$\begin{aligned} \varvec{B}_{n}(\varvec{x}_{i}, \varvec{x}_{c})={\left\{ \begin{array}{ll} 1 &{} \vert \varvec{x}_{i} - \varvec{x}_{c}\vert <d\\ 0 &{} \hbox {else} \end{array}\right. } \end{aligned}$$

Next, we must determine how to divide the image domain into local regions. We recall from Sect. 2.3 that although the probability functions rely on statistical data collected over the entire image domain, the actual gradient calculation is performed using probability functions evaluated along the object’s edge. This is depicted by the delta function multiplying the probability functions in Eq. (7). Hence, it is sufficient to define the local regions only along the contour. Specifically, we define a local regions around every point along the object’s contour. We extend the generative posterior-pixelwise model of Bibby and Reid (2008), used by Prisacariu and Reid (2012) to define the global appearance models, to a localized pixel-wise posterior model. The full derivation is presented in Appendix. The expression we arrive at for the local energy of the \( n'{\text {th}} \) region is given by:

$$\begin{aligned} E_n= & {} - \sum _{\varvec{x}_{i}\in \varOmega _{n}} \hbox {log} \left[ P_{f_{n}} H_{\epsilon } \left( \varPhi (\varvec{x}_{i}) \right) \right. \nonumber \\&\left. +\, P_{b_{n}} \left( 1-H_{\epsilon }\left( \varPhi (\varvec{x}_{i}) \right) \right) \right] \end{aligned}$$
(12)

Equivalently, this may be written using the characteristic function, \( \varvec{B}_{n}(\varvec{x}_{i}) \), as:

$$\begin{aligned} E_n&= - \sum _{\varvec{x}_{i}\in \varOmega } \hbox {log} \left[ P_{f_{n}} H_{\epsilon } \left( \varPhi (\varvec{x}_{i}) \right) \right. \nonumber \\&\left. \quad + P_{b_{n}} \left( 1-H_{\epsilon }\left( \varPhi (\varvec{x}_{i}) \right) \right) \right] \varvec{B}_{n}(\varvec{x}_{i}) \end{aligned}$$
(13)

where \(P_{f_{n}},P_{b_{n}}\) are the localized posterior probabilities of the \( n \hbox {'th}\) local region, which replace the global posterior probabilities of \(P_{f},P_{b}\). By replacing \(P_{f},P_{b}\) with \(P_{f_{n}},P_{b_{n}}\) we rely on the local statistical properties of each region, rather than global region statistics of the entire image. \(P_{f_{n}},P_{b_{n}}\)are given by:

$$\begin{aligned} P_{f_{n}}= & {} \frac{P(\varvec{y}\mid M_{f_{n}})}{\eta _{f_{n}}P(\varvec{y}\mid M_{f_{n}})+\eta _{b_{n}}P(\varvec{y}\mid M_{b_{n}})}\\ P_{b_{n}}= & {} \frac{P(\varvec{y}\mid M_{b_{n}})}{\eta _{f_{n}}P(\varvec{y}\mid M_{f_{n}})+\eta _{b_{n}}P(\varvec{y}\mid M_{b_{n}})}\\ \eta _{f_{n}}= & {} \sum _{\varvec{x}_{i}\in \varOmega }B_{n}(\varvec{x}_{i})H_{\epsilon }(\varPhi (\varvec{x}_{i})) \\= & {} \sum _{\varvec{x}_{i}\in \varOmega _{n}} H_{\epsilon }(\varPhi (\varvec{x}_{i})) \\ \eta _{b_{n}}= & {} \sum _{\varvec{x}_{i}\in \varOmega }B_{n}(\varvec{x}_{i})(1-H_{\epsilon }(\varPhi (\varvec{x}_{i}))) \\= & {} \sum _{\varvec{x}_{i}\in \varOmega _{n}}(1-H_{\epsilon }(\varPhi (\varvec{x}_{i}))) \\ \eta _{n}= & {} \eta _{f_{n}}+\eta _{b_{n}} \end{aligned}$$

The energy function which fuses the N local regions is defined as:

$$\begin{aligned} E= & {} \frac{1}{N}\sum _{n=1}^{N} E_{n} \end{aligned}$$
(14)
$$\begin{aligned} E= & {} - \frac{1}{N}\sum _{n=1}^{N}\sum _{\varvec{x}_{i}\in \varOmega } \hbox {log} \left[ P_{f_{n}} H_{\epsilon } \left( \varPhi (\varvec{x}_{i}) \right) \right. \nonumber \\&\quad \left. +P_{b_{n}} \left( 1-H_{\epsilon }\left( \varPhi (\varvec{x}_{i}) \right) \right) \right] \varvec{B}_{n}(\varvec{x}_{i}) \end{aligned}$$
(15)
Fig. 13
figure 13

Energy function evaluation over heterogeneous object

This equation shows the crucial difference between our method and the PWP3D. The PWP3D energy, shown in Eq. 2.2 relies on a single appearance model for the entire image domain, while our algorithm relies on N different local regions. In the PWP3D algorithm all the points considered in the summation rely on the same appearance model, while in our algorithm every point considered has its own appearance model, based on its local surroundings. The gradients of the energy function with respect to the pose parameters are given by:

$$\begin{aligned} \frac{\partial E}{\partial \lambda _{i}}= & {} -\frac{1}{N} \sum _{n=1}^{N} \sum _{\varvec{x}_{i}\in \varOmega _{n}}\frac{P_{f_{n}}-P_{b_{n}}}{P_{f_{n}}H_{e}\left( \varPhi \right) +P_{b_{n}} \left( 1-H_{e}\left( \varPhi \right) \right) } \nonumber \\&\quad \times \frac{\partial H_{e}\left( \varPhi \right) }{\partial \lambda _{i}}B_{n}(\varvec{x}_{i}) \end{aligned}$$
(16)

The term \( \frac{\partial H_{e}\left( \varPhi \right) }{\partial \lambda _{i}}\) depends strictly on the geometry of the object, they can be interpreted as the geometric differentials of the object with respect to the pose parameters. The term \( \frac{P_{f_{n}}-P_{b_{n}}}{P_{f_{n}}H_{e}\left( \varPhi \right) +P_{b_{n}} \left( 1-H_{e}\left( \varPhi \right) \right) }\) can be interpreted as the weight applied to the geometrical differentials, based on the statistical fit. In the PWP3D, Eq. 2.2, this element was determined based on the fit of the global appearance model to a given point. Whereas our extension, Eq. (16), weighs the geometrical differentials based on the fit of the local appearance model’s fit at the given point.

We illustrate the impact of the localized energy function on the basis of the glue object from the ACCV 2012 database (Hinterstoisser et al. 2012). The image in Fig. 13a shows the glue object, which we consider as heterogeneous—its body is white, however its tip is black and it has black texture on the body. We evaluate the term \(\hbox {log}\left[ P_{f}H_{e}\left( \varPhi \right) +P_{b} \left( 1-H_{e}\left( \varPhi \right) \right) \right] \) from Eq. 2.2 over the entire image, in Fig. 13b. This term is the contribution of each pixel in the foreground and background to the energy function. The figure illustrates the problem of using global appearance models—the black area adds a penalty on the energy function causing the object avoid such areas. Next, we evaluate the term \(\hbox {log} \left[ P_{f_{n}} H_{\epsilon } \left( \varPhi (\varvec{x}_{i}) \right) + P_{b_{n}} \left( 1-H_{\epsilon }\left( \varPhi (\varvec{x}_{i}) \right) \right) \right] \) from Eq. 12 over a single local region in Fig. 13c. This figure shows that the penalty due to including the glue’s black tip is considerably lower with respect to the global appearance models.

3 Local Region Size Selection

A key issue in the framework of local region segmentation is the selection of the region’s sizes on which the local statistics will be estimated. We use circular local regions and determine their size by setting their radius size. By adjusting the regions’ sizes we determine to what extant we use global or local region statistics:

  1. 1.

    As the region’s sizes increases the local regions become more correlated, with less variation between regions, thus leaning towards global region statistics. When the regions sizes exceed the size of the image, they will be identical, arriving at the original global model.

  2. 2.

    As the radius’s sizes decrease the local regions become less correlated, the variation between regions change more swiftly, allowing them to better capture the spatial variation.

The region size selection offers a basic trade off between robustness and the capability to capture the spatial variation—Using large sized regions, the region statistics will be more robust to the initialization of the object’s pose. Changes in the objects pose will have a lower impact on the statistical models. However, as we demonstrated in Fig. 2 the large region was insufficient in capturing the varying foreground and background variation in statistical properties. Using small sized regions, the ability to capture variations in the region’s statistics increases, as shown in Fig. 4a, b. The downfall of a small radius size is the loss of robustness. Small changes in the objects pose may strongly affect the region’s statistics to the extant of over-fitting. In this case the local foreground and local region’ statistics will not capture the true statistical properties of the foreground and background but rather the statistics of the region. The algorithm will optimize the pose of the object using the statistics of the local regions, rather than the true foreground and background. Consider for example the duck object in Fig. 14. Due to a poor scale selection the local region, shown as a circle, contains only foreground statistics. In this case the algorithm is likely to keep the object in the same location, as the energy will indicate a good segmentation.

Fig. 14
figure 14

Example of over-fitting due to small radius selected (radius = 10)

The radius selection is directly related to the well known trade off of model order selection. The model order is inversely proportional to the radius size—a small radius will result in many statistically independent regions, hence a high order model, whereas selecting a large radius will result in a higher correlation between the regions, restricting the model order. The trade-offs in region size is between descriptiveness, which increases with the model order, at the cost of robustness, which increases the model order decreases. Lankton and Tannenbaum (2008) studied the problem of radius size selection for localized active contour segmentation. They suggest selection of the parameter based on the scale of the object. A small object or a cluttered background require a small radius to correctly capture the variation between regions, whereas for a large object and a slowly varying background, a large radius is preferred. Their results are applicable to our problem as well, with a few subtle differences. In the active contours problem each point is free to move independently from the other points, whereas in our problem, the 3D model imposes a geometric constraint on the possible pose parameter propagation. This constraint is depicted in the gradient calculation, Eq. (16). In this equation a summation over the local regions is performed, hence the power of a single local region is limited.

We explored the impact of the radius size by examining the probability of an error as a function of radius size for various initial rotation errors. We defined a correct pose as a final error of less than 10 degrees. We performed this experiment using two objects—one a heterogeneous driller object, and the second a homogeneous ape object. The results of the experiments are presented in Fig. 15. A strong improvement is shown for the non-homogeneous driller, where the global model is insufficient, and little improvement for the homogeneous ape model, where the global model is expected to be sufficient.

Fig. 15
figure 15

Probability for a correct pose estimation as a function of radius size, for various rotation error sizes

The results demonstrate several issues discussed earlier:

  1. 1.

    Robustness—The performance of the algorithm is relatively stable for a wide range of radius sizes, from a radius of 40–120 pixels.

  2. 2.

    Over-fit—Selecting a very small radius (10 pixels) results in an over-fit, the performance is very good for small initial errors, and degrades as the initial error increases.

  3. 3.

    Global model insufficiency—for large radius sizes the performance is severely impacted. This is due to the insufficient model described earlier.

4 Implementation Details

4.1 Local Region Dilution

Defining a set of local regions on which the statistical calculations are performed has serious run time implications. In order to reduce run time we define a parameter Dilution Factor as the distance between local region where the local statistics are actually performed. In the intermediate local regions the statistics are estimated by performing a linear interpolation between the nearest regions. In our work we selected a dilution factor \( \hbox {d} = 0.05 \hbox {R} \).

4.2 Run Time

An important parameter in determining the feasibility of applying an algorithm in a realistic systems is its run time. The global PWP3D presented run time at the order of several milliseconds by parallelizing the computations using the CUDA framework on a Geforce video card. Applying the localized algorithm requires computation of the statistical properties on multiple local regions, in contrast to the global algorithm where the computation is performed in a single region. The run time required for histogram calculation may increase by a factor of O(N) , N being the of number of regions, due to the independent calculation required for each region. In practice this run time is expected to be considerably lower, as the region size on which each local histogram is performed is considerably smaller. The total run time is a function of many parameters—the 3D model complexity, code efficiency, hardware, etc. In order to compare the run time of the two algorithms we measured the average run time of each iteration for various local region sizes. We performed this experiment with two objects—the driller model and the ape model. We implemented both algorithms using the MATLAB Parallel Computing Toolbox MathWorks (2014) applied to CPU calculations, without GPU optimization. The run time ratio we arrived at was between 400 for a very small radius and 10 for radius sizes of approximately 20 pixels and above. This dependency on the radius size is a result of our selection of the dilution parameter as a factor of the radius size—as the radius size increases the number of actual calculations performed reduces.

Despite the increase in run time relative to the PWP3D, our algorithm’s run time may be significantly improved by parallelizing the histogram calculation within each local region. Additionally, using local appearance models requires estimating the regions statistics only over a smaller region of the image, rather than over the entire image. By using appropriate hardware (e.g., multi-processor GPU) and software language we estimate our algorithm will achieve similar run time performance as the PWP3D.

Recently, Prisacariu et al. (2013) presented a framework for joint 3D tracking and reconstruction on a mobile phone. The framework used for tracking is based on the global PWP3D of Prisacariu and Reid (2012), however by applying several optimizations the are able to present a mobile phone applicable algorithm. The run time optimization is applied to the three most time costly procedures—(i) Rendering—the projection of the 3D object onto the image is performed using a hierarchical binary rendering scheme. (ii) Efficient calculation of the Signed Distance Transform (SDF) derivatives—the computation of the (SDF) \(\varPhi \) derivatives is approximated only in a narrow band around the objects edge instead of over the entire object. (iii) Optimization—they apply Levenberg-Marquardt algorithm to find the optimal pose parameters from the energy function gradients. These steps are applicable to our framework as well in order to improve the run time.

4.3 Conditional Probabilities Estimation

The conditional PDFs are estimated by calculating the histograms (256 bins) of each color space and smoothing them using a Gaussian kernel. For simplicity we assumed the RGB channels are independent, therefor:

$$\begin{aligned} P\left( \varvec{y}_{RGB}\mid M\right) =P\left( y_{R}\mid M\right) P\left( y_{G}\mid M\right) P\left( y_{B}\mid M\right) \end{aligned}$$
(17)

However, using more realistic color models could be considered for better segmentation.

4.4 Optimization

In order to find the pose parameters which minimize the energy function we apply a simple first order gradient based optimization scheme. This iterative scheme requires finding the optimal step size selection in every iteration. We start by normalizing the rotation gradient and the translation gradient each to unit pixel and unit rotation size. We restrict our step sizes such that all translation components share the same step size, and all rotation components share another step size. This is required in order to keep the gradient of the rotation vector in the correct direction. The result is a 2D search for optimal rotation and translation step sizes, where the optimal value selected is the one which achieves the minimal energy value. We employ a coarse to fine search method—we reduce or increase the search region as the optimal step size decreases or increases.

Fig. 16
figure 16

Illustration of the maximal offsets of the rotation, translation and scale parameters where the algorithm may estimate the correct pose

5 Results

In order to demonstrate the strengths of our method, we performed a series of experiments, comparing the basin of attraction of our algorithm (labeled as local PWP3D) and the original PWP3D algorithm (Prisacariu and Reid 2012) (labeled as global PWP3D). The basin of attraction is defined in dynamical systems as the set of initial conditions leading to long-term behavior. In our case, it is used to measure the range of initial pose errors (angular or transnational) for which the algorithm converges to the correct pose. Based on the region size, a sampling scheme of initial guesses can be constructed in order to reliably estimate the pose of an object. We performed these experiments using the ACCV 2012 database of Hinterstoisser et al. (2012). The database comprises 15 different objects, with different levels of heterogeneity in a highly cluttered background. We selected a representative subset of objects (Fig. 17) which we divided into homogeneous and heterogeneous, and performed the following experiments:

  1. (a)

    Rotation angle basin of attraction—in this experiment, the object’s pose was initialized to some erroneous rotation around it’s center of mass and measured the probability of convergence to the correct pose. The axis of rotation is selected randomly, hence the rotation value is unsigned. The results of this experiment for each object are shown in the second column of Fig. 17. We illustrate the edge of the convergence region for the driller object (30 degrees) in Fig. 16a.

  2. (b)

    Translation parameters basin of attraction in the X and Y directions—we performed a similar experiment to compare the translation parameters basin of attraction. In this experiment we initialized the object’s pose to some erroneous translation in the x and y directions and measured the probability of estimating the correct pose.The results of this experiment for each object are shown in the third column of Fig. 17. We illustrate the edge of the convergence region for the ape object (60 % degrees) in Fig. 16b.

  3. (c)

    Scale parameter basin of attraction by setting the Z offset—in this experiment we measured the basin of attraction of the scale parameter by selecting offsets in the Z axis and measuring the probability of estimating the correct pose. The results of this experiment for each object are shown in the fourth column of Fig. 17. We illustrate the edge of the convergence region for the lamp object (40 % degrees) in Fig. 16c.

Fig. 17
figure 17

Performance analysis: probability of estimating the correct pose of each object. Homogeneous objects: Ape, Duck, Cat, Iron; heterogeneous objects: Driller, Glue, Lamp, Phone

Fig. 18
figure 18

Rotation angle basin of attraction for various local region radius sizes. R is the radius size in pixels of the local region

The success criteria defined in these experiments was a rotation error of at most 10 degrees and 10% of object size in translation. In all experiments we set the local region’s radius size to 30 pixels for all objects, despite their variability in size and shape. The consistent results across various objects achieved, despite the non optimal radius selected, indicates good robustness to radius size. The results for all three experiments show similar trends: For fairly homogeneous objects (Ape, Cat, Duck, Iron) the performance of the local PWP3D and the global PWP3D is very similar—the global appearance models are sufficient in order to describe the object. However, for heterogeneous models (Driller, Phone, Lamp, Glue) our algorithm shows significant improvement. Some of the objects in the heterogeneous group are more heterogeneous (e.g., glue, phone, lamp) than others (driller) and thus a more severe degradation is observed. We emphasize that the differences in performance are independent of the optimization algorithm selected. This is depicted by the results at a rotation angle of 0 degrees. An incorrect pose estimated at an initial error of 0 degrees indicates a minimum which is below the energy level of the ground truth. Hence, using the global model for heterogeneous object the optimal segmentation does not correspond to the correct pose.

5.1 Impact of Local Region Radius Size on Rotation Angle Basin of Attraction

In this section we present results of the rotation angle basin of attraction for various radius sizes. The experiment is an extension basin of attraction of the rotation angle parameter, to various local region radius sizes. We performed this experiment on the homogeneous Ape model and heterogeneous Driller model. As discussed in Sect. 3 the radius size selection is a tradeoff problem. Selection of too large a radius doesn’t allow to properly capture the statistics, whereas selection of too small a radius leads to over-fitting. This is apparent in the results presented in Fig. 18.

  • Ape—The basin of attraction of the ape model remains unaffected for most radius size selections, with the exception of the radius size of 10 pixels. This behavior is due to the homogeneity which does not require a more complex model. The exceptional case, where a radius size of 10 pixels is selected shows good behavior for small initial errors and decreases as the initial error grows. This type of behavior is typical of over-fitting.

  • Driller—For this object we observe three different behaviors:

    1. 1.

      \( R = 10 \)—In this case we observe the over-fitting once again.

    2. 2.

      \( 10 < R < 100 \)—For this range of radiuses the localized algorithm behaves fairly well.

    3. 3.

      \( R > 100 \)—For this range of radiuses, which includes global PWP3D, the performance degrades, as the model cannot capture the spatial variability of the driller object.

6 Conclusions

In this manuscript we have presented a novel framework for simultaneously estimating the 3D pose of an object and 2D image segmenting using a localized region based approach. Inspired by ideas shown for local active contours, we extend the PWP3D algorithm such that the segmentation is performed using local region statistics rather than global region statistics. This crucial difference allows us to extend the PWP3D algorithm to a new domain of objects, which are not homogeneous. We formulate our extension by defining multiple local energy functions, measuring the segmentation within each local region, and fusing them into a single energy function measuring the overall segmentation quality. Next we derive the gradients of the energy function with respect to the pose parameters. We experimented with our localized region based framework, comparing it with the recent PWP3D and showed a dramatic improvement for heterogeneous objects. Furthermore we show a considerable improvement in the performance for a wide range of radius sizes selected for the local regions. The measured basin of attraction indicates our algorithm could be suitable for a pose estimation scheme, and not only for 3D tracking, where a narrow basin of attraction is required due to small frame by frame variation.