INTRODUCTION

Motion capture techniques are used over a very broad field of applications, ranging from digital animation for entertainment to biomechanical analysis for sport and clinical applications. Sport and clinical applications require excellent accuracy and robustness. Two other major requirements for clinical applications are to minimize the time for patient preparation and the inter-observer variability. At present, using reflective markers on the skin is the most common technique.5 , 12 , 13 Despite their precision and popularity, marker based methods have several limitations: (i) markers attached to the subject can influence the subject's movement, (ii) a controlled environment is required to acquire high-quality data, (iii) the time required for marker placement can be excessive, and (iv) the markers on the skin can move relative to the underlying bone, leading to what is commonly called skin artifact.4 , 14, 32 Several recent review articles have summarized the common shortfalls of skin based marker techniques.6 , 8 , 21

Markerless motion capture offers an attractive solution to the problems associated with marker based methods. However, the use of markerless methods to capture human movement for biomechanical or clinical applications has been limited by the complexity of acquiring accurate three-dimensional kinematics using a markerless approach. The general problem of estimating the free motion of the human body or more generally of an object without markers, from multiple camera views, is underconstrained without the spatial and temporal correspondence that tracked markers guarantee.

Model based approaches provide methods to address some of the complexities associated with a markerless approach. An a priori model of the subject, for example, can be used to strongly reduce the total number of degrees of freedom of the problem. Another option is to increase the number of cameras so that more measured data is available to solve for a given number of degrees of freedom. Thus the robustness of a markerless approach can be increased by increasing the number of cameras and by limiting the search space of possible body configurations to anatomically appropriate ones. This last strategy can be pursued by using a human model to identify the motion of the subject.

Several model based methods have been proposed in the past, modeling the human body or parts of it with rigid1, 14, 9, 15, 18, 29, 30 or non rigid segments.19 However, these approaches have problems with accurate identification of three dimensional kinematics of the segments or use a limited number of body segments.7 For what concerns the mathematical formulation of the model joints, exponential maps3 is able to provide several advantages to previous approaches by simplifying the estimation of model pose and leading to robust identification of body kinematics.

Another important consideration in choosing an approach for markerless motion capture is the formulation of the cost function used to match the representation of the subject (2D silhouette, 3D visual hull, 2D features, etc.) to the model. While approaches utilizing only 2D information have been used,3 most biomechanical applications require a 3D model. In the approach that we pursued we built the subject's 3D representation using shape-from-silhouette technique. Our group has developed several methods in the past to reconstruct the outer surface of the body.10 , 26 In this study the 3D representation has been obtained using the algorithm described in.26

The other important consideration in designing an approach involves the choice of an optimization algorithm that will successfully minimize the cost function to allow the calculation of the subject's kinematics. This optimization is difficult because the cost function has many local minima and the search space has very high dimensionality, but it can be accomplished through the simulated annealing matching algorithm running on an exponential maps geometry formulation. Simulated annealing is a statistical computational method based on Boltzmann Sampling and the Metropolis Monte Carlo method.24 In standard Monte Carlo simulation, the state of a system is randomly changed by sampling the search space. All such changes are accepted, whether or not the new state results in a reduction of the cost function. Thus the system may be in a high-cost state much of the time, and the simulation has to run for a very long time to properly sample all low-cost regions of the search space. The modification made by Metropolis and co-workers is that there should be a probabilistic acceptance of a Monte Carlo step. This modified algorithm is known as Metropolis Monte Carlo simulation, and it further evolved into the method known as simulated annealing. Simulated annealing has the capability to identify states defined by many degrees of freedom while consistently reducing the risk of getting trapped into local minima. Thus the integration of a method that includes a subject's model, visual hulls and simulated annealing as described above offers a potential framework for a markerless motion caption system.

The purpose of this study was to describe the development and to validate such a markerless motion capture system. The goals of this system are to have the advantage of not requiring the placement of markers or the design of an acquisition protocol, and to be potentially usable in any possible environment where a large number of synchronized calibrated cameras are available. From these multiple views the geometric representation of the human body is reconstructed based on a visual hull concept, which was first described in.20 Simulated annealing is used to match an a priori 3D model to the visual hull, and the subsequently calculated kinematics of the matched model are validated against ground truth in a virtual environment.

FIGURE 1.
figure 1

Visual hull reconstruction concept. The silhouettes of the subject from different camera planes are back projected in space. Their intersection generates the visual hull, a locally convex over-approximation of the volume occupied by the subject's body.

FIGURE 2.
figure 2

(a) Poser model in the reference pose, (b) 33 DOF full body model. Points in highly deformable regions have been removed, as most clearly seen at the hips.

FIGURE 3.
figure 3

Results of the matching algorithm (colored points) applied to the virtual environment sequence superimposed over the virtual character (gray surface).

FIGURE 4.
figure 4

Results of the matching algorithm applied to the virtual environment sequence. Points visual hull (top, in blue) and matched model (bottom) for a gait cycle, in the sagittal plane.

METHODS

The developed method, described in the following sections, is used to track the motion of a human subject in both a virtual and a real environment. 16 cameras in the virtual environment and eight cameras in the experimental setup (same resolution 640 by 480 pixels) were used to obtain the subject's 3D representation. Tracking the 3D representation using the described matching algorithm leads to the extraction of the subject's kinematics. The following subsections describe the several steps required to achieve this goal.

Reconstruction of the Subject's Visual Hull

The visual hull of an object, first extensively described in,19 can be defined as the locally convex (over) approximation of the volume occupied by an object. The 3D representation of the motion of the subject across the motion capture volume consists of one visual hull for each instant in time captured by the camera system. The visual hull construction process, diagrammed in Fig. 1, consisted of the projection of the subject's silhouette from each of the camera planes back to the 3D volume. The intersection of the resulting cones in 3D space generated the subject's visual hull. The 2D silhouettes in the camera planes were obtained by foreground/background separation for every captured frame. In general, previously-described shape-from-silhouette methods reconstruct the subject's visual hull by dividing the 3D space into cubic voxels whose size is inversely proportional with the desired resolution.11 , 18 , 23 , 28 , 33 The method used in the present work belongs to the same family of algorithms and its detailed description can be found in.26 , 28 , 33 An extensive study on the influence of camera number, resolution and placement on reconstructed visual hull quality can be found in.26 Its applicability in the typical in vivo experimental setup of a gait analysis laboratory was demonstrated.

Exponential Maps Formulation

The adopted exponential maps formulation27 guarantees a simple linear representation of the motion with the posture uniquely defined, avoiding the nonlinearities and singularities common to the Euler angles formulation. The exponential map for a twist describes the relative motion of two coordinate frames in space, and in the same way, every rigid transformation can be represented by a combination of twists. The exponential map formulation allows the representation of multiple twists by simply multiplying the exponentials of the transformation matrices together (1).

For example, for the lower limb, the rigid movement of the thigh with respect to the pelvis can be written as a single screw transformation (also known as a finite helical axis transformation) or as a combination of three twists. Going from the pelvis (defined in Eq. (1) as segment (a) all the way down to the foot (defined in Eq. (1) as segment (b), the final position of a point on the foot can be expressed as a function of the articular parameters of hip, knee and ankle joints through the multiplication of twists, as shown in Eq. (1). The pelvic coordinate system is in this case the reference,

$$ g_{ab} (\vartheta ) = g_{ab} (0)\prod\limits_{i = 1}^n {e^{\hat \xi _i \vartheta _i } } $$
(1)

\( \hspace{-3pt}\!\vartheta = (\vartheta _1 , \ldots ,\vartheta _8 ) \) is the state vector (scalar angles) for a kinematic chain with eight degrees of freedom and n is the number of degrees of freedom, in this case equal to 8.\( (\vartheta _1 ,\vartheta _2 ,\vartheta _3 ) \) represents the three rotational degrees of freedom of the thigh with respect to the pelvis, \( (\vartheta _4 ,\vartheta _5 ,\vartheta _6 ) \) represent the three rotational degrees of freedom of the shank with respect to the thigh, and\( (\vartheta _7 ,\vartheta _8 ) \) represent the two rotational degrees of freedom of the foot with respect to the shank. g ab (0) represents the transformation matrix from a to b coordinate frame in the initial configuration, i.e. with all state variables equal to zero. The general twist matrixes \( \hat \xi _i \) are defined for each joint relative to the parent segment, following the formulation described in.27

The final transformation from one body configuration to another one is given by a matrix T defined as follows:

$$ K'_{p \times 4N} = K_{p \times 4N} T = K_{p \times 4N} \left[ {\begin{array}{*{20}c}{g_1 } & 0 & {...} & 0 \\[5pt]0 & {g_2 } & {...} & 0 \\[5pt]{...} & {...} & {g_i } & {...} \\[5pt]0 & 0 & {...} & {g_N } \\\end{array}} \right]_{4N \times 4N} $$
(2)

where g i is the 4×4 matrix representing the generic rigid transformation for segment i with respect to the parent segment. T is a 4N×4N square matrix in which N is the total number of segments in the human body model. K matrices contain the p visual hull points in homogeneous coordinates.

Full Body Model

The model contains morphological information (surface with 1600 points) and kinematics information about how the model can move. The morphological information came from a reference pose (Fig. 2a). Then the model was segmented into the different parts corresponding to the 12 main anatomical segments shown in Fig. (2b): pelvis, thighs, shanks, feet, arms, forearms, and combined torso and head. For the real application with human subjects, the morphological information was obtained from a laser scan of the subject, providing an accurate description of the body's outer surface that was then manually segmented.

FIGURE 5.
figure 5

Comparison between ground truth provided by the virtual environment model and the results from the matching algorithm (gray shaded area indicates ± one standard deviation) for (a) knee flexion, (b) knee adduction, (c) hip flexion, and (d) knee adduction.

FIGURE 6.
figure 6

Comparison between ground truth provided by the virtual environment model and the results from the matching algorithm (gray shaded area indicates ± one standard deviation) for (a) ankle dorsiflexion, (b) ankle inversion, (c) shoulder flexion, and (d) shoulder abduction.

The kinematic model (depicted by the lines connecting joints in Fig. 2b) includes the full body and has 33 degrees of freedom (DOF). Joints were modeled as ball-socket joints or as simple hinge joints. In particular, for the lower limbs, the hip and knee were modeled as spherical joints with three degrees of freedom in rotation (flexion-extension, adduction-abduction, internal-external rotation), while the ankle was modeled as a double hinge joint having two rotational DOF (plantar-dorsi flexion, in-eversion). For the upper body the movement between the torso and the pelvis was modeled as a simple hinge-joint with one rotational DOF (flexion at the 5th lumbar), the shoulder was modeled as a spherical joint (flexion-extension, internal-external rotation, adduction-abduction) and the elbow was modeled as a double hinge joint having two rotational DOF (flexion-extension and pronation-supination). The remaining six degrees of freedom described the rigid body translation and rotation in space of the root segment, the pelvis. The geometrical formulation of the model is open in the sense that any joint model can be modified independently without the need to readjust the others. More complex joint models may take into account both rotational and translational behavior using the same mathematical structure, by using the appropriate formulation of a particular joint that allows the translation along the particular twist axis. The completed model was created by rigidly joining the morphological representations of each segment (each a set of points describing a surface) to the corresponding rigid segments of the underlying kinematic model. Motion was constrained to anatomically consistent ranges.

Surface points close to the joints in the model were removed to minimize the influence of tissue deformation that occurs around the joints during movement.

Matching Process by Simulated Annealing

The matching process consisted of the minimization of a cost function in a continuous domain describing the quality of matching between the model and the visual hull cloud of points (made of about 2500 points). This matching was done for each time frame in order to identify the whole motion of the subject. Since all degrees of freedom were matched simultaneously the search space was 33-dimensional (number of DOF in the kinematic model). Gradient-based methods were not appropriate to solve such a high-dimensional problem due to the large number of local minima in which the algorithm could get trapped. Instead, we adopted a stochastic approach called simulated annealing that is an extension of the original Metropolis Monte Carlo method. This class of methods has been refined during last decades and has the capability of climbing up local minima until the desired matching accuracy is achieved.16 , 17 , 31 , 34

Simulated Annealing

The implemented simulated annealing method uses the acceptance function (3) proposed by Metropolis,24 which is a function of the parameter T and of the value of the cost function f. The parameter T, commonly called temperature due to the analogy of the optimization process with the chemical process of annealing, is a function that decreases as the iteration number increases.

$$ A(x,y,T) = \min \left\{ {1,{\rm e}^{\frac{{f_y - f_x }}{T}} } \right\} $$
(3)

Moving from current state x i to next state x i +1, the step is accepted or not depending on (4) where p is sampled from a uniform distribution [0, 1] and the value k i +1 is a state sampled from a chosen distribution (see next paragraph).

$$ x_{i + 1} = \left\{ {\begin{array}{*{20}c}{y_{i + 1} = x_i + k_{i + 1} } \hfill &\quad {{\rm if}\,p \le A(x_i ,y_{i + 1} ,T_i )} \hfill \\\hspace*{-5pc}{x_i } \hfill &\hspace*{-3pc} {{\rm otherwise}} \hfill \\\end{array}} \right. $$
(4)

Since the parameter T plays an important role in the acceptance function, several authors have proposed different formulations for its decreasing function (cooling schedule) and the corresponding sampling distribution for k i +1, in order to improve the performances of the algorithm that normally has a high computational cost.16 , 17 , 24 , 31 , 34 The formulation used in this work is described in34 which samples k i +1 from a Cauchy distribution. Sampling in this way allows the algorithm to visit each region with positive Lebesgue measure infinitely often when a cooling schedule proportional to T 0/i is adopted, where T 0 is a large enough constant and i is the number of iterations. To assure better capabilities for climbing up local minima (as demonstrated in simulated trials10), in this work the parameter T is not decreased linearly with respect to the number of iterations but depends also on the value of the cost function. An extensive and complete description of the general simulated annealing method can be found in.22

The Cost Function

An appropriate choice of the cost function is one of the core requirements for successful and robust matching. Two clouds of points need to be matched, one that is articulated using a kinematic model and the other coming from visual hull reconstruction. The latter has in general a non constant number of points through the sequence frames, and there is no correspondence between points in different time frames even though they are equally spaced in a 3D voxel structure. The chosen cost function for this work (5) was a variation on the Hausdorff distance and has been shown to be very robust even if computationally demanding.

$$ {\rm COST}(A,B) = \sum\limits_{\forall a \in A} {\min \left\{ {\mathop {\left\| {a - b} \right\|}\limits_{\forall b \in B} } \right\}} $$
(5)

The cost function used here differs from the original formulation of the Hausdorff distance since it sums every single contribution, instead of taking just the maximum between the minimal distances between pairs of points. This modification increases robustness to possible outliers. As is the case for the original Hausdorff distance, this cost function is not commutative. Intuitively, one could state that a low value of COST(A, B) guarantees that all the points of set A are not very far from their closest point of set B. However, it does not guarantee that all points of set B are not very far from their closest point of set A. In the first frame of the sequence, the visual hull points are set A, while the model points are set B, since the two sets may not be close to each other (visual hull-to-model formulation). For subsequent frames the next visual hull frame is always very close to the previous matched model state, so the cost function is changed to the model-to-visual hull formulation, which guarantees better accuracy because it is less sensitive to phantom volumes in the visual hull. Phantom volumes are defined as a large local deviation from the real subject's outer body surface resulting from the use of too few cameras. In our case phantom volumes generate points of set B (visual hull) far from their closest point of set A (model) that are neglected since the cost function—based on COST(A, B)—only accounts for the distance between points of set A from their closest point of set B (model-to-visual hull).

Motion Data: Virtual Environment

In order to provide data with a real ground truth, a virtual character was animated with known kinematics using Poser ® software (by Curious Labs, CA—USA). In the virtual sequence a male subject walks along a straight line mimicking a gait analysis sequence. Since the animation software uses Euler angles formulation, the internal-external rotations of each joint were set to zero to avoid cross talk between rotations along different axes. Sixteen virtual cameras were uniformly distributed in a most favorable hemispherical configuration26 around the virtual character. Images from each camera were taken at every frame of the gait sequence. Silhouettes were extracted from each camera image and then processed to create the visual hulls that feed the matching algorithm presented in the previous sections.

Motion Data: Experimental

To demonstrate the effectiveness and potential of the method for biomechanical applications, a running sequence of a human subject was captured using 8 color video cameras with a resolution of 640 by 480 pixels and a frame rate of 75 frames/s. A running sequence is more challenging than gait analysis for the tracking algorithm since it involves higher velocities and accelerations of the anatomical segments. The acquisition was done in a standard gait analysis laboratory environment, i.e. without altering the background or lightning conditions. The sequence was processed with the same algorithms described in this section. The subject's model was created using a 3D laser scan (Whole Body 3D scanner Model WBX by Cyberware—USA, accuracy within 1mm and about 15 seconds scanning time). The 11 joint centers of a 33 degree of freedom model were manually identified on the model obtained from the laser scanner.

The described method was validated in a virtual environment and qualitatively tested in experimental conditions. Using the virtual environment permitted the evaluation of the accuracy of extracting human body kinematics while excluding errors due to experimental artifacts (e.g. due to camera calibration errors, errors in background subtraction, etc.), thus obtaining the true potential of the method with the given camera setup (16 cameras, 640×480 pixel resolution). A Kalman filter was used to smooth results and improve the quality of derivatives.

RESULTS

The motion obtained in the virtual environment for the model (colored points) and the original character compared favorably (Fig. 3). In Fig. 4 the visual hulls and the matched model in the sagittal plane are shown as point clouds. The joint angles for the walking sequence are known and are compared with the ones obtained from the matching algorithm. Motions at the hip, knee, ankle and shoulder show good agreement between the virtual character kinematics and the matched model results (Figs. 5 and 6). Moreover, the algorithm does not drift, as shown by the fact that the errors do not increase with frame number.

The errors for the hip, knee, ankle and shoulder joints through the entire gait sequence are reported in Table 1. Good results are obtained in terms of mean absolute errors for flexion and adduction of hip, knee and shoulder. Errors are slightly bigger for the ankle joint, mainly due to the poor ratio between camera resolution and dimensions of the foot. The high degree of axial symmetry of the thighs and shanks makes it difficult for the algorithm to track internal-external rotation, leading to lower accuracy. This symmetry results in mean errors of 3.9° (±4.1°) and 2.7° (±4.7°) for hip and knee internal-external rotations, respectively.

Table 1. Summary of the validation results for joint angles at the hip, knee, ankle and shoulder.

Unlike most other tracking algorithms, the presented method does not require accurate initialization of the model to match the first frame. A rough rigid body positioning (as shown in Fig. 7, left) of the model in a reference frame is enough to have a consistent matching of the first frame of the sequence (Fig. 7, right). The computational time for solving for the entire sequence in this non-optimized version is on the order of few hours, since no specific hardware or optimized software has been used.

The experimental results relative to a running sequence of a male subject are presented in Fig. 8. The sequence was processed with the algorithms described in the methods section. The effectiveness of the tracking results on the visual hull are shown in Fig. 8 where the point clouds representing the different anatomic segments consistently overlay the visual hull of the subject (shown in gray).

FIGURE 7.
figure 7

Matching of the first frame. The visual hull point cloud is shown in blue, while the different segments of the model are shown in other colors. The algorithm does not require a good initialization of the model in order to achieve the first matching (right).

FIGURE 8.
figure 8

Result of the matching algorithm (color points) applied to an experimental data sequence (gray surface). On the bottom the corresponding video images of the running sequence from one camera are shown.

DISCUSSION

The proposed method has been quantitatively validated for several joints. An effective tracking capability has been shown even for smaller body segments like the feet, which are normally neglected by other approaches. The results with this data also demonstrate the robustness of the approach, since the performance of the matching process did not deteriorate with frame number, a common problem for most feature-tracking based approaches.1,2 Unlike in those approaches, a bad initial guess will only increase the computational time necessary to obtain the desired matching because in each frame the model is being matched to the absolute position of the visual hull rather than to the change in the video images from the previous frame to the current one.

In the kinematic model 33 degrees of freedom have been modeled, including 3 rotational degrees of freedom for the hip, knee and shoulder together with 2 degrees of freedom for the ankle (dorsi-plantarflexion, in-eversion). This set of degrees of freedom is appropriate for most biomechanical studies of the lower limb. In fact, having for example 3 DOF at the knee is important in order to investigate the secondary rotations which are a crucial point in understanding injury and disease mechanisms. One critical aspect of developing a successful algorithm for biomechanical analysis of many different activities is the measurement of internal-external rotation of the hip and the knee, which is quite noisy due to the almost cylindrical symmetry of the thigh and shank. This aspect represents the main limitation of the presented method that remains to be addressed in the future. Nevertheless, an increase in camera resolution alone would provide more accurate 3D reconstruction of the subject.25 This increase would improve the overall tracking accuracy and reduce the problem of internal-external rotation of the hip and knee because the existing axial asymmetries in the thigh and shank would be better represented. This is one of the major challenges that will need to be addressed in the future since assessing true motions in transverse plane can be very relevant from a biomechanical point of view. The described algorithm, being based on the entire shape instead of a small number of single points, shows good performance even with camera resolutions of an order of magnitude lower than current stereophotogrammetric systems for marker-based motion analysis. This approach also offers a great potential for the reduction of skin artifact error. Instead of relying on just a few markers it is based on the tracking of few hundred points per segment, naturally averaging the skin artifact phenomenon across the segment during the matching process.

Markerless motion capture also guarantees a great reduction of the amount of time for subject preparation compared to marker based methods since no time for marker placement is required. Moreover inter-operator variability is eliminated since no trained operator is needed to accurately place markers. On the other hand, the processing time is longer since the creation of the model is required. Future work must automate the setting up of the model and address how foreground/background separation and camera calibration affect the accuracy of the results, which have not been directly addressed in the virtual environment validation. However, the experimental data presented already show good qualitative results, demonstrating the effectiveness and potential of the method for use in a clinical environment such as a gait laboratory. Moreover, tracking a running sequence, which is in general more challenging than walking due to increased velocities of the subject, demonstrated the robustness of the method. Overall, in the authors’ opinion, the system can provide sufficient accuracy for biomechanical research and has great potential for future improvements, both in the creation of the model and in the matching algorithm. It is our intention to provide in the future an experimental comparison/validation against currently available techniques such as marker based methods.

Two further extremely valuable advantages of the presented algorithm are: (i) apart from a rough rigid body positioning, it does not need to be initialized at all, being able to go from a reference pose to the first frame pose consistently, as shown in Fig. 7; (ii) it directly provides joints centers and segment volume information during motion that can be used for a more accurate calculation of the subject's kinetics.