1 Introduction

Human motion capture using markers (active or passive) requires cameras to be positioned around the volume of interest so that at least two cameras can view each marker. Three-dimensional marker positions are then calculated by triangulation, and subject movements can thus be defined. When the movement is complex or is produced in the presence of obstacles, it is more difficult to capture markers. Consequently, a human operator tries to find camera configurations that minimize marker occlusion. This task remains difficult and requires time as well as many validation tests, even in the case of an experienced expert.

Camera configuration has a critical impact on the overall motion capture performance. The configuration quality to determine the marker in 3D is strongly affected by two sources of error: marker visibility and triangulation accuracy. In order to address these error sources, the following points should be taking into consideration:

  1. 1.

    A Marker should not be occluded by any obstacle.

  2. 2.

    A Marker should be within the camera’s field of view.

  3. 3.

    A Marker must be visible given a camera’s resolution, the marker’s size, and its distance from the camera.

  4. 4.

    At least two cameras are required to reconstruct a marker.

  5. 5.

    Cameras should be arranged sufficiently non-parallel so that triangulation calculations are well conditioned.

  6. 6.

    Incorporating additional cameras leads to over conditioned triangulation. However, it involves a higher cost.

To assist human operators in configuring camera networks and improving human motion capture, we developed a computer tool using a guided genetic algorithm to simulate the best camera network configuration. Our approach looks at capturing tags (or markers) attached to virtual humanoid performing movements similar to those in the real world. The goal is then to simulate the best camera network for a specified virtual scene.

2 Related work

The problem of finding optimal camera placement has been studied for a long time. The first research on the subject is based on the famous art gallery problem, of which the main purpose is to find the minimum number of cameras that can monitor a fixed number of paintings in a gallery (Chvatal 1975). Initial formulations lacked realistic models for cameras and the environment (Lee and Lin 1986; O’rourke 1987). Cameras are assumed to have 360-degree field of view, and there is no resolution degradation with viewing distance. Additionally, both the target space and the locations of cameras are restricted to 2D. Mittal and Davis (2008) and Piciarelli et al. (2010) incorporated more realistic models of cameras and the environment and expanded the search space to 3D.

Unlike the previous techniques which formulated the camera placement problem in 2D or 3D continuous spaces, most recent approaches consider the problem entirely in the discrete domain (Hörster and Lienhart 2006). Instead of optimizing a continuous function using a variational calculation, approaches in the discrete domain quantify the search space from a finite number of candidate positions and look for the best configurations to optimize a given cost function. This strategy naturally leads to combinatorial problems. Efforts have been made to develop discrete camera positioning problems using linear binary programming (Gonzalez-Barbosa et al. 2009) and quadratic programming (Ercan et al. 2006).

Moreover, as the majority of these formulations lead to NP-hard problems, a multitude of practical solutions have been proposed: binary integer programming (Zhao et al. 2008), greedy approach (Al Hasan et al. 2008), Monte Carlo simulations (Yao et al. 2010), genetic algorithms (Olague and Mohr 2002) and particle swarm optimization (Morsly et al. 2012).

The high dimensionality of the search space and nonlinearity of the camera placement problem for motion capture led us to promote the use of genetic algorithms. In the following, we present research based on the use of genetic algorithms (GA) to solve the camera placement problem.

Olague and Mohr (2002) used a genetic algorithm to evaluate a set of camera network configurations based on horizontal and vertical camera orientation (\(\alpha\), \(\beta\)). They proposed a 3D reconstruction uncertainty measurement using a scalar function of the covariance matrix. This method proposes the resolution constraint as the only cause of 3D uncertainty. Hence, a quality metric was built by (Chen and Davis 2008) taking resolution and occlusion into consideration. Rahimian and Kearney (2015) show that their optimization algorithm can lead to more significant performance gains over Chen and Davis (2008) method both in terms of the visibility metric and in the robustness of full-body motion capture. However, their method does not estimate the minimum number of cameras for a given accuracy threshold and workspace.

Indu et al. (2009) proposed an optimization of the camera network placement using a genetic algorithm based on camera position parameters and horizontal and vertical camera orientation (\(x_{{\mathrm{c}}}\), \(y_{{\mathrm{c}}}\), \(z_{{\mathrm{c}}}\), \(\alpha\), \(\beta\)). This study defines a coverage matrix by dividing the area to be photographed into a uniform grid, and each grid cell is classified as a priority area, a non-priority area, or as an obstacle. This method becomes ineffective when performed with a high number of grids elements.

Katz et al. (2010) adopted Olague and Mohr (2002) method using the same 3D reconstruction uncertainty measurement but with all the camera position and orientation parameters (\(x_{{\mathrm{c}}}\), \(y_{{\mathrm{c}}}\), \(z_{{\mathrm{c}}}\), \(\alpha\), \(\beta\), \(\gamma\)) to find an optimal network placement. This method seems ineffective in solving the motion capture problem because it is limited to finding local minima and therefore does not always provide the right camera placement. In addition, the adjustment of the population size or crossover and mutation rates is manually performed by the user; this is disadvantageous to the effectiveness of their method.

Fig. 1
figure 1

Proposed methodology to find preferential camera network placement for motion capture

The main elements that determine the differences between previous GA-based approaches are the number of camera parameters and the way they are used to evaluate placement performance. Researchers have gradually increased the number of camera parameters in order to find adequate solutions that do not require manual intervention. However, these algorithms suffer due to their excessive computation time since they simultaneously handle several solutions. Moreover, incorrect determination of parameters such as population size or mutation rate may limit the effectiveness of the algorithm.

3 Proposed approach

Our camera placement assistance methodology is divided in two modules: graphical modelling, and camera placement optimization (Fig. 1).

Graphical modelling includes scene creation, camera placement, and virtual humanoid animation. Animation involves enabling the virtual humanoid to produce movement similar to that in the real world in order to develop a specific scenario. This scenario can be recorded, modified, or rebuilt.

The camera placement optimization module applies GGA to the scenario created in order to establish the best camera network placement using an appropriate number of cameras. This module is composed of three blocks. Distribution and estimation technique (DET) block is used to restrict the search space, the error metric block is used to evaluate the performance of the configured camera network, and the adaptive genetic algorithm block starts with the volume estimated by the DET and uses the error metric to optimize the initial camera network.

3.1 Module 1: Graphical modelling

In this section, the graphical modelling that allows us to create a virtual scenario is defined. This step is considered as the starting point of the optimization process. It includes Scene modelling, Camera modelling, and Human modelling.

3.1.1 Scene modelling

The scene represents the volume in which the cameras and the animated humanoid can be placed. The user specifies minimum and maximum values for the dimensions of each scene along the x, y and z axis (Fig. 2).

3.1.2 Camera modelling

Cameras are modelled in 3D (Fig. 2). The location and view direction of the cameras in the scene are described by extrinsic parameters, including:

  • Three position coordinates: \((x_{{\mathrm{c}}}, y_{{\mathrm{c}}}, z_{{\mathrm{c}}})\).

  • Rotation matrix R with three angles: (\(\alpha\), \(\beta\), \(\gamma\)).

    $$\begin{aligned} R=R_x(\alpha )\times R_y(\beta )\times R_z(\gamma ) \end{aligned}$$
    (1)

    with

    $$\begin{aligned} R_x(\alpha )= & {} \begin{pmatrix} 1 &{}\quad 0 &{}\quad 0\\ 0 &{} \quad \cos (\alpha ) &{} \quad -\sin (\alpha )\\ 0 &{} \quad \sin (\alpha ) &{} \quad \cos (\alpha ) \end{pmatrix} \end{aligned}$$
    (2)
    $$\begin{aligned} R_y(\beta )= & {} \begin{pmatrix} \cos (\beta ) &{} \quad 0 &{}\quad \sin (\beta )\\ 0 &{} 1 &{} 0\\ -\sin (\beta ) &{} \quad 0 &{}\quad \cos (\beta ) \end{pmatrix} \end{aligned}$$
    (3)
    $$\begin{aligned} R_z(\gamma )= & {} \begin{pmatrix} \cos (\gamma ) &{} \quad -\sin (\gamma )&{} \quad 0\\ \sin (\gamma ) &{} \quad \cos (\gamma ) &{} \quad 0\\ 0 &{}\quad 0 &{}\quad 1 \end{pmatrix} \end{aligned}$$
    (4)
  • \(R_x\) rotates y axis towards z axis.

  • \(R_y\) rotates z axis towards x axis.

  • \(R_z\) rotates x axis towards y axis.

3.1.3 Human modelling

A figure is represented as an H-Anim (LOA 1) model (Lee et al. 2015), in which each model node corresponds to a body part (e.g. legs, torso, arms). Tags are fixed to specific locations on the virtual humanoid according to their placement on the real actor (Fig. 2b, c). The virtual humanoid is animated by a sequence of frames. Each frame is displayed on the screen until the next frame overwrites it. This process creates a specific movement (e.g. walking, running, or jumping) that can be captured, modified, or built.

Fig. 2
figure 2

a Modelling a motion capture scene in 3D with a virtual humanoid and camera. The camera is placed according to the parameters \(x_{{\mathrm{c}}}\), \(y_{{\mathrm{c}}}\), \(z_{{\mathrm{c}}}\), \(\alpha\), \(\beta\) and \(\gamma\) in relation to scene landmarks. b, c Present an example of the positioning of the tags on the virtual humanoid body segments in accordance with what is planned in the real world, a graphical model, b front view of humanoid, c back view of humanoid

3.2 Module 2: Camera placement optimization

In the field of motion capture, GGA improves the genetic algorithm at initialization, evaluation, and reproduction operators levels to address the problems of previous studies.

The initialization step uses DET for choosing the initial population of the camera networks.

In the evaluation step, our error metric is used to estimate the visibility of the tags for each camera network, which provides a fitness score. This fitness score is then used to assess the performance of the camera networks.

In the reproduction step, adaptive strategies are used to provide the best crossover and mutation results.

We divided this camera placement optimization module into three blocks (Fig. 1): distribution and estimation technique, error metric, and adaptive genetic algorithm described in detail below.

3.2.1 Distribution and estimation technique

DET is used to compute an error metric for marker distribution during the movement to be analysed and propose the best volume to place the initial camera networks (Fig. 3).

We define a \({{ tag}}_j(x_j, y_j, z_j)\) as a given 3D point in the scene’s frame, and at each instant t the 3D position of a set of tags is given by Eq. 5:

$$\begin{aligned} L(t)= \begin{bmatrix} {{ tag}}_1(t)\\ \vdots \\ {{ tag}}_j(t) \end{bmatrix} = \begin{bmatrix} x_1(t)\\ y_1(t)\\ z_1(t)\\ \vdots \\ x_j(t)\\ y_j(t)\\ z_j(t) \end{bmatrix} \end{aligned}$$
(5)

where L(t) is the position of all the tags at instant t.

\(x_j\), \(y_j\) and \(z_j\) are the coordinates of \({{ tag}}_j\forall j \in [1\ldots N]\), where N is the number of tags in the scene. \({{ tag}}_j(t)\) is the \(j{\hbox {th}}\) tag at instant t.

From Eq. 5, we define a movement as a position matrix:

$$\begin{aligned} \overbrace{ \begin{bmatrix} {{ tag}}_1(t_0)&\cdots&{{ tag}}_1(t_{{{ Nf}}})\\ {{ tag}}_2(t_0)&\cdots&{{ tag}}_2(t_{{{ Nf}}})\\ \vdots&\ddots&\vdots \\ {{ tag}}_j(t_0)&\cdots&{{ tag}}_j(t_{{{ Nf}}}) \end{bmatrix}}^{{{ movement}}} = \begin{bmatrix} x_1(t_0)&\cdots&x_1(t_{{{ Nf}}})\\ y_1(t_0)&\cdots&y_1(t_{{{ Nf}}})\\ z_1(t_0)&\cdots&z_1(t_{{{ Nf}}})\\ x_2(t_0)&\cdots&x_2(t_{{{ Nf}}})\\ y_2(t_0)&\cdots&y_2(t_{{{ Nf}}})\\ z_2(t_0)&\cdots&z_2(t_{{{ Nf}}})\\ \vdots&\ddots&\vdots \\ x_j(t_0)&\cdots&x_j(t_{{{ Nf}}})\\ y_j(t_0)&\cdots&y_j(t_{{{ Nf}}})\\ z_j(t_0)&\cdots&z_j(t_{{{ Nf}}}) \end{bmatrix} \end{aligned}$$
(6)

where \(t\in [0\ldots {{ Nf}}]\) and Nf is the number of frames.

Based on Eq. 6, initial camera locations are generated using the following two steps:

Determining initial camera positioning zone The purpose of this step is to position the cameras in the proposed solution zone (Fig. 3). To define this zone, firstly we determine, from the point clouds, the overall volume of the tags movement. Then, we select the space where tags are visible given a camera’s resolution, the marker’s size, and its distance from the camera.

Determining camera orientation from an interest point In this step, we calculate the orientation parameters (\(\alpha\), \(\beta\), \(\gamma\)) according to camera position (\(x_{{\mathrm{c}}}\), \(y_{{\mathrm{c}}}\), \(z_{{\mathrm{c}}}\)) and an interest point (IP) as a target at which the camera lens is pointed. This interest point is located in the overall volume of the tags movement at (\(x_{{{ IP}}}\), \(y_{{{ IP}}}\), \(z_{{{ IP}}}\)) from the centre of the scene \(O_{{\mathrm{s}}}\) according to Eq. 7.

$$\begin{aligned} IP= \frac{1}{N} \sum \limits _{i=1}^{N} [[x_i, y_i, z_i]^{{\mathrm{T}}}]o_{{\mathrm{s}}}\Leftrightarrow \left\{ \begin{array}{ll} x_{{{ IP}}}= \frac{1}{N} \sum \limits _{i=1}^{N}x_i \\ y_{{{ IP}}}= \frac{1}{N} \sum \limits _{i=1}^{N}y_i \\ z_{{{ IP}}}= \frac{1}{N} \sum \limits _{i=1}^{N}z_i \end{array} \right. \end{aligned}$$
(7)

Then, we calculate the distances xside, yside, and zside between IP and the camera position in each axe as follows:

$$\begin{aligned}\displaystyle {xside}&=|x_{{\mathrm{c}}}-x_{{{ IP}}} | \\ \displaystyle {{ yside}}&= | y_{{\mathrm{c}}}-y_{{{ IP}}} | \\ \displaystyle {{ zside}}&= | z_{{\mathrm{c}}}-z_{{{ IP}}} |\end{aligned}$$
(8)

Using Eq. 8, we define d and r as the distances in 2D and 3D Euclidean space as follows:

$$\begin{aligned} d= & {} \sqrt{({ xside})^2 + ({{{ yside}}})^2} \end{aligned}$$
(9)
$$\begin{aligned} r= & {} \sqrt{d^2 + ({{{ zside}}})^2} \end{aligned}$$
(10)

Thus, according to the centre of the scene \(O_{{\mathrm{s}}}\), the camera directions \(\alpha\), \(\beta\) and \(\gamma\) can be calculated as follows:

$$\begin{aligned} \displaystyle \alpha= & {} \arcsin ({{{ yside}}}/r) \nonumber \\ \displaystyle \beta= & {} \arccos ({{{ zside}}}/d) \nonumber \\ \displaystyle \gamma= & {} \arctan ({ xside}/d) \end{aligned}$$
(11)
Fig. 3
figure 3

a Search zone of initial camera locations (capture zone) for motion capture session according to minimum and maximum resolution. b Camera location and orientation according to points of interest IP. c Calculation of camera orientation

3.2.2 Error metric

The goal is to define an optimization function f, which can measure the ability of a camera network to capture tags in a 3D environment. It is then possible, out of all the possible camera network configurations, the one that maximizes this function using a minimum number of cameras.

For each tag, the number of cameras, for which the tag is visible, is determined in order to estimate the quality Q of the camera locations using the reconstruction error \(E_{{\mathrm{r}}}\). There are two possible situations for \(E_{{\mathrm{r}}}\): critical case, when no or only one camera captures the tag; Therefore, \({{ Er}} = 1\). Favourable case, when two or more cameras capture the tag; Therefore, \({{ Er}} = 0\).

This means that:

$$\begin{aligned}&\forall C_i; \forall {{ tag}}_j{:} \left\{ \begin{array}{ll} E_{{\mathrm{r}}}=1 \Leftrightarrow \left\{ \begin{array}{ll} \forall f_k; \sum \limits _{i, j=1}^{Nc, Nf}P_i {{ tag}}_j=0\\ \qquad \text{ or } \\ \forall f_k; \sum \limits _{i, j=1}^{Nc, Nf}P_i {{ tag}}_j=1 \end{array} \right. \\ E_{{\mathrm{r}}}=0 \Leftrightarrow \forall f_k; \sum \limits _{i, j=1}^{Nc, Nf}P_i {{ tag}}_j \ge 2\end{array} \right. \nonumber \\&k \in [1\ldots Nf], Nf=t_{{{ mov}}}/f_e, P=K.R.t, Nc\ge Nc_{{{ min}}} \end{aligned}$$
(12)

where

  • \({{ Nf}}\) is the number of frames.

  • \(t_{{{ mov}}}\) is the movement duration.

  • \(f_e\) is the acquisition frequency of the movement (60 Hz).

  • \({ P}\) is the projection matrix.

  • \({ K}\) is the intrinsic parameters matrix.

  • \({ R}\) and t are the rotation and translation parameters. The minimum number of used cameras must be \({{ Nc}}_{{{ min}}}\ge 2\).

We define a binary vector containing reconstructed tags (re) and a matrix \({ A}\) containing the values \({ a}_{ ij}\) of binary vectors whith:

$$\begin{aligned} {{ re}}_j&= {\left\{ \begin{array}{ll} 1 &{} {\text {if}} \quad E_{{\mathrm{r}}}=0 \\ 0 &{} {\text {if}} \quad E_{{\mathrm{r}}}=1 \end{array}\right. }\end{aligned}$$
(13)
$$\begin{aligned} a_{ij}&= {\left\{ \begin{array}{ll} 1 & {\text {if\,tag}}\,j\,{\text {is\,captured\,by\,camera}}\,i\\ 0 & {\text {else}} \end{array}\right. } \end{aligned}$$
(14)

In order to satisfy optimum camera locations, three constraints should be considered: Occlusion, camera field of view, and camera resolution.

Occlusion A marker is completely visible if nothing is blocking the view of the camera. All entities (static, dynamic or the subject itself) can be considered as obstacles \({{ Obs}}_{k}\) (Fig. 4). This constraint is expressed by Eq. 15 (Malik and Bajcsy 2008):

$$\begin{aligned} Occ= {\left\{ \begin{array}{ll} {\text {Visible}}{:} &{} \left( {{ SL}}_{CM}\cap \left( {\mathop {\bigcup }\limits _{k=1}^{n}} {{ Obs}}_{k}\right) \right) ={\varnothing } \\ {\text {Occluded}}{:} &{} {\text {else}} \end{array}\right. } \end{aligned}$$
(15)

where \({{ SL}}_{CM}\) is the straight line connecting the camera centre C and marker M.

Fig. 4
figure 4

Marker visibility constraints

Resolution It represents the minimum and maximum distances from which the camera can accurately view markers (Fig. 4), and is represented by the following formula (Zhao et al. 2009):

$$\begin{aligned} D_{{{ min}}} \le \Vert \left[ (X_{{\mathrm{c}}}, Y_{{\mathrm{c}}}, Z_{{\mathrm{c}}})-(X_m, Y_m, Z_m)\right] \cdot v_i\Vert \le D_{{{ max}}} \end{aligned}$$
(16)

where \(v_i\) is the unit vector along the optical axis of the camera’s view trajectory.

Field of view The camera’s field of view allows for determining if tags can be observed by the camera (Fig. 4). The camera’s field of view is calculated as follows:

$$\begin{aligned} \sigma {\text {(in\, degrees)}} = 2\arctan \left( \frac{h}{2 D_{{{{ min}}}}}\right) \end{aligned}$$
(17)

Thus, we seek to maximize the following optimization function:

$$\begin{aligned} f=w_1(\sigma )+w_2(Res)+w_3(Occ) \end{aligned}$$
(18)

where \({\textit{w}}_1\), \({\textit{w}}_2,\) and \({\textit{w}}_3\) are the weights of importance predefined by the designer. This gives:

$$\begin{aligned} { Max} \sum \limits _{i}\, f \,{\text { subject\, to}}\, Q=\frac{E_{{\mathrm{r}}}}{Nf\times Nt} \end{aligned}$$
(19)

For a minimum number of cameras:

$$\begin{aligned} { Min} \sum \limits _{i=1}^{Nc}C_i\,{\text {given}}\sum \limits _{i, j, k=1}^{Nc, Nt, Nf}(a_{ij})_k\ge {{ re}}_{{{ min}}} \end{aligned}$$
(20)

Based on these equations, the objective is to evaluate the quality of the camera network to maximize the visibility of the tags present in the 3D scene.

3.2.3 Adaptive genetic algorithm

Our adaptive genetic algorithm proceeds in four steps: initialization, evaluation, selection, and reproduction as follows:

Initialization A population of camera configurations is created as an initial solution using the DET block. An individual is a vector of real numbers containing the translation and orientation parameters for each camera. We use 4–10 cameras, which results in 24-dimensional vectors of up to 60 vectors. The population consisted of 300 such individuals (Fig. 5).

Fig. 5
figure 5

Population of a camera network (initial solution). Each individual contains n cameras and the population consist of m such individuals

Evaluation Each individual in the population is evaluated according to the error metric as described in Sect. 3.2.4 and assigned a corresponding fitness score. Our error metric was the 3D point reconstruction error applied to a distribution of tags. In our algorithm, the error from the reconstructed point returns a single score per individual, which allows evaluating its fitness. The evaluation is stopped after a predetermined number of iterations (e.g. 60 iterations) or if the reconstruction error is below a certain threshold (e.g. \(f=0.9\)).

Selection The population is sorted according to the fitness of the evaluated camera networks. Elitism (Goldberg and Holland 1988) is used as the selection method for camera networks exhibiting a fitness score \({\ge }0.5\) (Fig. 6).

Fig. 6
figure 6

Selection of solutions where at least half of the markers are seen (fitness \({\ge }0.5\))

Reproduction The camera networks selected in the previous step are used to create a new generation of camera networks based on crossover and mutation operators (Fig. 7). The choice of crossover probability \(P_{{\mathrm{c}}}\) and mutation probability \(P_m\) has a direct impact on the convergence of the algorithm (Mitchell 1998; Szeliski 2010). Moreover, it is difficult to find the best values of \(P_{{\mathrm{c}}}\) and \(P_m\) that adapt to all problems. Because of this, we adopted adaptive crossover and mutation strategies.

Commonly used adaptive operators are detailed in (Angeline and Angeline 1995; Kita 2001; Xu et al. 2003). We used crossover and mutation operators from (Kita 2001) to reduce the population size and rapidly converge to the optimal solution. These operators are efficient (Aissaoui et al. 2014).

The crossover probability is expressed:

$$\begin{aligned} P_{{\mathrm{c}}} = \left\{ \begin{array}{ll} k_1\left( f_{{{ max}}}-f\right) /\left( f_{{{ max}}}-\bar{f}\right) , &{} f \ge \bar{f} \\ k_2, &{} f <\bar{f} \end{array} \right. \end{aligned}$$
(21)

The mutation probability is expressed:

$$\begin{aligned} P_{m} = \left\{ \begin{array}{ll} k_3\left( f_{{{ max}}}-f\right) /\left( f_{{{ max}}}-\bar{f}\right) , &{} f \ge \bar{f} \\ k_4, &{}f <\bar{f} \end{array} \right. \end{aligned}$$
(22)

where

  • \(k_1\), \(k_2\), \(k_3\), \(k_4\): real numbers \({\le } 1.0\).

  • \(f_{{{ max}}}\): maximum fitness value of population.

  • \(\bar{f}\): average fitness value for each generation.

  • f: fitness Value of mutant individuals.

It is well established in the literature on genetic algorithms (Srinivas and Patnaik 1994; Bao et al. 2011) that \((P_{{\mathrm{c}}} \ge 0.5)\) and \((P_m \ge 0.01)\) produce good algorithm convergence. Thus, we fixed values of \(k_1\), \(k_2\) at 0.5 and \(k_3\), \(k_4\) at 0.01.

Fig. 7
figure 7

Reproduction operators

Finally, the solution space represents an n-dimensional polyhedron which changes in size and shape whenever a new generation is created. The crossover and mutation operators act as visiting corner points of the polyhedron. Thus, by visiting this corner points an optimal placement solution is proposed when the objective function is maximized.

4 Simulation and performance evaluation

4.1 Simulation

In order to compare the performance of our proposed GGA with that of GA, simulations were conducted on MATLAB 8.5 (MathWorks 1994) using three scenarios: walking, running, and jumping. The necessary simulation parameters are shown in Table 1.

Table 1 Simulation parameters

4.2 Performance evaluation

In order to evaluate the performance of GGA from the simulation results, we considered the following three performance measures: optimization quality, execution time, and optimization cost.

4.2.1 Optimization quality

This measure is related to the fitness value of each generated camera network. The fitness value defines convergence to the optimal camera network. The closer the value is to “1”, the more efficient the camera network.

Figure 8 indicates the progression of the average fitness value in 60 iterations for both algorithms GGA and GA. For the 3 scenarios, the GA started with a low fitness value, but GGA started with a significant fitness value (initial iterations). For the 3 scenarios, we notice that the performances in terms of fitness are clearly better for the GGA.

Fig. 8
figure 8

Optimization quality of GGA and GA with respect to average fitness value for evaluated camera networks

4.2.2 Optimization time

This represents the maximum time allowed to provide the best camera network for a given scenario. If the algorithm does not provide a response within this limit for a given test, it is considered a failure, even though the algorithm could provide a fair answer with more time.

As shown in Fig. 9, GGA has a significant advantage compared to GA in terms of optimization time. GGA successfully selects the most suitable camera locations for the scenario studied to recover tag coordinates in a reasonable time. Nevertheless, the optimization time of GA varies depending on the movement studied. Thus, as a scenario becomes more complex, the optimization time clearly increases.

Fig. 9
figure 9

Optimization time of camera locations according to the number of reconstructed tags over the number of iterations for GGA and GA

4.2.3 Optimization cost

This measure is essential for motion capture applications. The goal is to minimize the number of cameras used for motion capture, which in turn minimizes the necessary cost for the motion capture setup.

Concerning the optimization cost, Fig. 10 demonstrates that GGA reaches a maximum rate of recovered tags (100%) by using only 4 or 6 cameras with a runtime of 15 min. However, GA uses 10 cameras to reach 100% with a runtime of 60 min.

Fig. 10
figure 10

Rate of reconstructed tags with 4, 6, and 10 cameras which refers to the optimization cost for both GGA and GA

Figure 11 shows the best camera network configurations obtained with GGA and GA for the 3 scenarios where, regardless of the complexity of the scenario, the best configuration with GGA is obtained by using a reduced number of cameras compared to GA.

For scenarios walk, run, and jump, our GGA algorithm finds the best configuration by using 4, 4, and 6 cameras, respectively; however the best configuration of GA is obtained by using 8, 10, 10 cameras, respectively. This means that the GGA is more efficient than the GA in minimizing the number of cameras used and actually less costly.

Fig. 11
figure 11

Best camera network configurations obtained with GGA and GA algorithms for the 3 scenarios

5 Conclusion

This paper presented a method to optimize camera network configuration for a motion capture. The proposed method improves the performances of classical genetic algorithms to deal with the problem of divergence of solutions and decrease the computation time of the optimization. To achieve this goal an error metric that evaluates the quality of capture and a distribution and estimation technique that restricts the search space were developed.

The work proposed in this paper opens new challenges in the future to improve GGA robustness. We will explore the possibility of combining GGA with other evolutionary techniques to speed up the convergence process and satisfy the requirements of the global optimum problem to optimize camera network placement for motion capture.

Finally, this work opens new perspectives in terms of the acquisition of human movement in complex environments (e.g. with obstacles). We want to develop a digital application, based on this work, to help position the cameras in an experimental scene, which will save time, and also limit the loss of markers affixed to the subject being tested.