1 Introduction

In recent years, hand detection, gesture recognition and tracking applied to perceptive user interfaces and virtual reality have gained growing attention [7, 23, 25]. Even though hand posture recognition (static hand gesture) has been reported to be very successful in HCI (Human Computer Interaction), dynamic gesture is continually faced with many obstacles for accurate recognition and tracking in single camera due to following reasons [24]. Initially, the modeling of hand gesture is difficult resulting from the deformable, articulated structure of hands and the self-occlusion problem. Secondly, the cluttered environment and varying lighting conditions lead to inaccurate hand segmentation. Thirdly, the hand appearances vary among people of different races, ages, weights, and so on. In order to solve these problems, several hand gesture recognition and tracking methods have been proposed and specifically categorized into the following two classes.

Firstly, hand gestures are modeled and recognized using geometrical characteristics of the segmented hand region [4, 5, 18, 37]. In this method, once the segmented image is obtained, convex hull and convexity defects are calculated so as to find the area of hand and finger position, and whereafter vertices of the convex hull are considered as the fingertips. It causes some mistakes when the vertices represent other corners of hand silhouette rather than fingertips, which leads to mistakes in gesture recognition and tracking. Secondly, the depth image is used for gesture recognition and hand tracking by some researchers [12, 28, 31]. The advantage of 3D cameras lies in higher discrimination between gestures and background while the disadvantage lies in relatively lower resolution. In addition, under the condition of equivalent resolution, the price of 3D cameras is much higher than that of ordinary cameras. Thirdly, deep learning has drawn much attention nowadays in gesture recognition and hand tracking [8, 19, 29]. Methods based on deep learning performs excellently in some particular gestures through long time training, whereas the time consumed in real-time applications is unsatisfactory. Fourthly, other algorithms based on models are used in some literatures. In literature [24], the authors proposed a centroid tracking of hand gestures that captured and retained time sequence information for feature extraction. Zhang [36] used a multi-cue integration hand tracking method by integrating the motion and color cues from a feature point selection view.

In the process of human-computer interaction, there is a problem that has always been existing yet received hardly any attention. That is, the human-computer interaction system using a static camera will stop work after the human gesture leaves the camera’s field of view. Actually, a PTZ camera is a desirable option to solve this problem. Nevertheless, none of these literatures focused on the usage of a PTZ camera combined with a dynamic hand tracking algorithm to achieve servo tracking.

Distinguished from the existing approaches, a real-time dynamic gesture recognition and hand tracking using a PTZ camera was proposed in this study, and the aim of which was to achieve a robust scheme that stably recognized simple hand gestures, and tracked the hand using a PTZ camera to keep the fingertip remaining in the center of the camera.

Specifically, the hand region in a cluttered environment was segmented using skin color segmentation in YCbCr color space and by which the silhouette of the hand was got. For every point in the silhouette, a sub contour point set was constructed, and the Monte Carlo Sampling method was used to obtain the best fitted Cubic Bezier curve to the sub contour set. Subsequently, the curvature on the continuous Cubic Bezier curve of the corresponding point was considered as the estimated curvature of the point in the silhouette. Giving the estimated curvatures, the local maximums of a cumulative curvature curve were detected as the candidate fingertips. After that, geometry feature analysis including convexity approach and feature triangle analysis were used to locate the final fingertip and recognize several specific simple gestures. Finally, if a tracking gesture (right click down) was recognized, the system drove the PTZ camera to track the fingertip, and kept the fingertip always in the middle field of view of the camera in real-time. The overview of the proposed method is shown in Fig. 1.

Fig. 1
figure 1

Overview of the proposed method

The innovative points of the proposed method included:

  • A new fingertip detection method was proposed based on Monte Carlo sampling Cubic Bezier curve fitting.

  • A feature triangle analysis was used to realize simple dynamic gesture recognition.

  • The gesture recognition was combined with a PTZ camera to achieve servo tracking in real-time.

This paper is organized as follows. Section 2 details the proposed gesture recognition method including hand region detection and extraction, fingertip detection based on Monte Carlo Sampling cubic Bezier curve fitting, and simple gesture modeling using feature triangle analysis. Furthermore, Section 3 introduces the experimental results and analyses. Last but not least, Section 4 outlines the conclusions and future research.

2 The proposed gesture recognition and tracking method

2.1 Hand region detection and extraction

The first step of finger detection was detecting the hand region which exerted an essential role in the whole process since a complete and accurate hand region laid the foundation for further analysis. In the proposed method, this was achieved through skin color segmentation in the YCbCr color space which was perceptually uniform and widely used in video compression standards. Additionally, it performed well in the separation of luminance (Y) and chrominance (Cb, Cr) together with the compactness of the skin cluster. An image with RGB color space was transformed to YCbCr color space using the following equation:

$$ \left[\begin{array}{c}Y\\ {} Cb\\ {} Cr\end{array}\right]=\left[\begin{array}{c}16\\ {}128\\ {}128\end{array}\right]+\left[\begin{array}{ccc}65.481& 128.553& 24.996\\ {}-37.797& -74.203& 112\\ {}112& -93.768& -18.214\end{array}\right]\left[\begin{array}{c}R\\ {}G\\ {}B\end{array}\right] $$
(1)

In this paper, the extraction of hand region from background was accomplished using a threshold technique proposed by Chai and Ngan [6]. Beyond that, the threshold range of Cb and Cr in our method was narrower than that in [6], which was RCb = [77, 127] and RCr = [133, 155]. It was aimed to minimize the effect of background noise.

After the skin detection and extraction, a morphological filtering process including an erosion operation and a dilation operation was conducted to obtain some more complete candidate regions. Subsequently, the largest region was chosen as the desired hand region (see Fig. 2).

Fig. 2
figure 2

Hand region segmentation Left: input image; Middle: skin color segmentation result; Right: hand contour

2.2 Fingertip detection based on Monte Carlo sampling Bezier curve fitting

As the first step of gesture recognition, fingertip detection exerted a crucial role in this technology and the previously proposed methods of which were specifically categorized as follows.

Firstly, the fingertip detection was handled by correlation techniques using a circular mask in which the fingertip was modeled as a cylinder with a hemispherical cap [3, 13, 14, 21, 22, 30]. Subsequently, normalized correlation was used with a template of a properly sized circle corresponding to a user’s fingertip size. The segmentation of the hand must be very accurate in order to achieve a good detection result under this model and fulfill the requirement of the particular shape of an approximate cylinder. Secondly, the curvature local maximum on the boundary of the silhouette was used as the feature to detect fingertips [2, 15, 20, 26]. The curvature was calculated using the discrete points on the silhouette in all these curvature-based methods, which led to the high sensitivity to noises such as local valleys caused by incompleteness of hand region segmentation. Thirdly, convexity approach was used for finger detection and hand gesture recognition in some latterly literatures, by which convex hull and convexity defects were calculated so as to find the area of hand and finger position as soon as the segmented image was obtained [4, 10]. And then, the vertices of the convex hull were considered as the fingertips, which caused mistakes when the vertices represented other corners of the hand silhouette rather than fingertips. Fourthly, an algorithm based on the distance of the contour points to the hand position was also used in some literatures [11, 17]. In this algorithm, it was assumed that the points with longer distance to the center of the hand were considered as fingertips. However, a hand was deformable and the center of which was probably far away from the expected position if the gesture was not regular, which thus led to an error in fingertip location.

Differed from the existing approaches, an innovative way using Cubic Bezier Curve fitting based on Monte Carlo Sampling was proposed in this study for fingertip detection, by which the robustness to noise and the accuracy of fingertips location were significantly improved.

2.2.1 Bezier curve presenting

After obtaining the contour of the object hand, it was essential to calculate the curvature of every point in the contour as well as locate the local maximums.

The contour points set is {CPm}  1 ≤ m ≤ M, where M is the total number of the points of the contour, and a sub contour points set of CPm including 2L + 1 points is set as follows:

$$ \left\{C{P}_{m+j}^s\right\}\kern0.48em 1\le m\le M\kern0.36em -L\le j\le L $$
(2)

They are centered in contour point CPm and consist of L contour points before CPm as well as L contour points after CPm (see Fig. 3). A fitted curve to the sub contour points set was discovered and the curvature of the fitted curve at middle point was the estimated curvature of the contour at CPm. The problem was transformed into seeking for a robust and efficient curve fitting method to the sub contour points set \( C{P}_{m+j}^s \). In addition, a ‘Bezier curve Monte Carlo Sampling and fitting method’ with high fitting accuracy as well as computation efficiency was proposed.

Fig. 3
figure 3

Sub contour centered at CPm

Bezier curves were widely used in computer graphics for smooth curve modeling. The curve was completely contained in the convex hull of its control points, thus the points were graphically displayed and used to manipulate the curve intuitively. Affine transformations, such as translation and rotation were applied to the curve by using respective transformation on the control points of the curve.

An nth-degree Bezier curve is represented by the following equation:

$$ P(t)=\sum \limits_{i=0}^n{P}_i{B}_{i,n}(t)\kern0.36em t\in \left[0,1\right] $$
(3)

where Pi refers to the control points, n refers to the degree, and t refers to the parameter. The basis function, Bi, n(t), is the classical nth-degree Bernstein polynomials defined explicitly by:

$$ {B}_{i,n}(t)={C}_n^i{t}^i{\left(1-t\right)}^{n-i}=\frac{n!}{i!\left(n-i\right)!}{t}^i{\left(1-t\right)}^{n-i} $$
(4)

In consideration of the accuracy and calculative efficiency, degree three (n = 3), or the Cubic Bezier (see Fig. 4), was adopted in the current study. Namely, four points were required to confirm a Cubic Bezier, in which two were off the curve and called the control points while the other two were on both ends of the curve and called the end points. Giving the sub contour points set \( C{P}_{m+j}^s \), the curve fitting problem was to find four optimal control points to define the fitting curve. An easy way is was to use exhaustive search method through the possible space, yet this brute force method was time-consuming. In this paper, the Monte Carlo method was proposed to find the optimal control points quickly.

Fig. 4
figure 4

Cubic Bezier Curve

2.2.2 Monte Carlo sampling

The Monte Carlo method provides approximate solutions to various mathematical problems by performing statistical sampling experiments on a computer. It applies to problems with absolutely no probabilistic content as well as to those with inherent probabilistic structure. The Cubic Bezier curve fitting problem in this work is formulated [21] as:

$$ \underset{x\in X}{\min}\left\{f(x):= E\left[F\left(x,\xi \right)\right]\right\} $$
(5)

Where F(x, ξ) is a function of two vector variables x ∈ n and ξ ∈ d.X ∈ n is a given set. In this work, it means the sub contour points set \( C{P}_{m+j}^s \). ξ ∈ ξ(ω) is a random vector, which means a candidate sub contour \( P{\left(\omega, t\right)}_m^s \), defined by randomly generated control points vector ω. The expectation in Eq. (5) is taken with respect to the probability distribution of ξ which is assumed to be known. The function F(x, ξ) is formulated as:

$$ F\left(x,\xi \right)=\sum \limits_{j=0}^{2L+1}{\left|P\Big(\omega, t\Big){}_m^s-C{P}_{m+j}^s\right|}^2 $$
(6)

The objective of the function is to find the best fitted Bezier curve to the given sub contour points set using the least square method. Since ξ has a finite number of possible scenarios with positive probabilities pk,k = 1, ...K, the expected value is written as:

$$ f(x)=\sum \limits_{k=1}^K{p}_kF\left(x,{\xi}_k\right) $$
(7)

However, using exhaustive discretization of the probability distribution lead to an exponential growth of the number of scenarios. Therefore, the Monte Carlo Sampling technique was used to reduce the computational complexity. The proposed approach was accomplished by generating random sequence sub Bezier curve contours ξ1, ξ2, ... in the sample space, which were independent of each other and uniformly distributed (i.e. independent identically distributed). With the generated N random samples ξ1, ξ2, ...ξN, the sample’s average function is written as:

$$ {\hat{f}}_N(x):= \frac{1}{N}\sum \limits_{i=1}^NF\left(x,{\xi}^i\right) $$
(8)

Indeed, \( {\hat{f}}_N(x) \) is an unbiased estimator of f(x). Moreover, according to the law of large number, \( {\hat{f}}_N(x) \) is a consistent estimator of f(x) when N → ∞. Subsequently, the Cubic Bezier curve fitting problem in Eq. (5) is approximated:

$$ \underset{x\in X}{\min}\left\{{\hat{f}}_N(x):= \frac{1}{N}\sum \limits_{i=1}^NF\left(x,{\xi}^i\right)\right\} $$
(9)

Specifically, according to the feature of Cubic Bezier curve, the start control point P0 and the end control point P3 are on the curve, while P1 and P2 are off the curve. Giving the sub contour points set \( C{P}_{m+j}^s \), the four control points of each candidate sub contour are decided by the randomly generated control points vector ωζ, ζ = 1, 2...N where N signifies the total random sample number. Consequently, a candidate sub contour using Cubic Bezier curve is expressed as:

$$ {P}_{\zeta }{\left({\omega}^{\zeta },t\right)}_m^s=\sum \limits_{i=0}^3{P}_i{\left({\omega}^{\zeta}\right)}_m{B}_{i,3}(t)=\left[{t}^3\kern0.5em {t}^2\kern0.5em t\kern0.5em 1\right]\;\left[\begin{array}{cccc}-1& 3& -3& 1\\ {}3& -6& 3& 0\\ {}-3& 3& 0& 0\\ {}1& 0& 0& 0\end{array}\right]\;\left[\begin{array}{l}{\omega}_{0m}^{\zeta}\\ {}{\omega}_{1m}^{\zeta}\\ {}{\omega}_{2m}^{\zeta}\\ {}{\omega}_{3m}^{\zeta}\end{array}\right] $$
(10)

The control points \( {\omega}_{0m}^{\zeta } \), \( {\omega}_{1m}^{\zeta } \), \( {\omega}_{2m}^{\zeta } \) and \( {\omega}_{3m}^{\zeta } \) are uniformly random variables which can be drawn from the following range independently:

$$ {\displaystyle \begin{array}{l}{\omega}_{0m}^{\zeta}\in \left[C{P}_{m-L}^s-{T}_0,C{P}_{m-L}^s+{T}_0\right]\\ {}{\omega}_{1m}^{\zeta}\in \left[{T}_{\mathrm{min}}-{T}_0,{T}_{\mathrm{max}}+{T}_0\right]\\ {}{\omega}_{2m}^{\zeta}\in \left[{T}_{\mathrm{min}}-{T}_0,{T}_{\mathrm{max}}+{T}_0\right]\\ {}{\omega}_{3m}^{\zeta}\in \left[C{P}_{m+L}^s-{T}_0,C{P}_{m+L}^s+{T}_0\right]\end{array}} $$
(11)

where T0 denotes a consistent threshold which defines a tolerance interval of the corresponding range, and Tmin, Tmax are described as:

$$ {\displaystyle \begin{array}{l}{T}_{\mathrm{min}}=\min \left\{C{P}_{m+j}^s\right\}\kern0.36em -L\le j\le L\\ {}{T}_{\mathrm{max}}=\max \left\{C{P}_{m+j}^s\right\}\kern0.36em -L\le j\le L\end{array}} $$
(12)

The Eq. (11) means that the start control point and the end control point of a candidate Cubic Bezier curve are supposed to be within an area centered in the corresponding point of the given sub contour \( C{P}_{m-j}^s \) and \( C{P}_{m+j}^s \), while the other two control points should possess equal chance to appear in any position of the whole sample space defined by \( C{P}_{m+j}^s \) (see Fig. 5). The start and end control point are not put exactly at the point \( C{P}_{m-j}^s \) and \( C{P}_{m+j}^s \), since the hand contour extraction is probably so accurate and the position of corresponding contour point possibly contains some uncertainty, which is thus modeled by sampling the points within an area centered in \( C{P}_{m-j}^s \) and \( C{P}_{m+j}^s \).

Fig. 5
figure 5

Distribution of control points Yellow: possible positions of control point P0; Brown: possible positions of control point P3; Green: possible positions of control point P1 and point P2

2.2.3 Cubic Bezier curvature and local maximum extraction

Using the above-described Monte Carlo Sampling method (i.e. sampling randomly in the sample space N times), a set of control points which fit the given sub contour \( C{P}_{m+j}^s \) best are discovered as \( \left\{{P}_i^{best},i=0,1,2,3\right\} \). After that, the curvature at any point in the Bezier curve is derived by the following equation:

$$ K(t)=\frac{\mid P\hbox{'}(t)\times P\hbox{'}\hbox{'}(t)\mid }{{\left|P\hbox{'}(t)\right|}^3} $$
(13)

Where the first derivative P ' (t) and the second derivative P '  ' (t) of the Bezier curve are represented as follows:

$$ P\hbox{'}(t)=\sum \limits_{i=0}^n{P}_i^{best}{B}_{i,n}\hbox{'}(t)=n\sum \limits_{i=1}^n\left({P}_i^{best}-{P}_{i-1}^{best}\right){B}_{i-1,n-1}(t) $$
(14)
$$ P\hbox{'}\hbox{'}(t)=n\left(n-1\right)\sum \limits_{i=0}^{n-2}\left({P}_{i+2}^{best}-2{P}_{i+1}^{best}+{P}_i^{best}\right){B}_{i,n-2}(t) $$
(15)

Specifically, the first and the second derivative of Cubic Bezier curve are written as:

$$ P\hbox{'}(t)=3\left[\left({P}_1^{best}-{P}_0^{best}\right){\left(1-t\right)}^2+\left({P}_2^{best}-{P}_1^{best}\right)t\left(1-t\right)+\left({P}_3^{best}-{P}_2^{best}\right){t}^2\right] $$
(16)
$$ P\hbox{'}\hbox{'}(t)=6\left[\left(-{P}_0^{best}+3{P}_1^{best}-3{P}_2^{best}+{P}_3^{best}\right)t+\left({P}_0^{best}-2{P}_1^{best}+{P}_2^{best}\right)\right] $$
(17)

As mentioned above, the given sub contour set is centered in contour point CPm and consists of L contour points before CPm as well as L contour points after CPm. That is to say, the curvature of the point CPm can be seen as the curvature of the middle point in the best fitted Bezier curve defined by \( \left\{{P}_i^{best},i=0,1,2,3\right\} \). Specifically, K(t = 0.5) is assumed to be the curvature that we need.

Since the curvatures of all contour points are calculated, the local maximums with a relatively high probability to be the fingertips are extracted. However, the curvatures obtained by the Monte Carlo Sampling method still contain some noises that make the extraction of local maximums still very difficult (see the left image of Fig. 6). In order to address this problem, a smooth process is introduced to reconstruct a new curvature Kc (also called cumulative curvature), which is described as follows:

$$ {K}_i^c=\sum \limits_{j=i}^{i+2L}{K}_j $$
(18)
Fig. 6
figure 6

The curvatures of each contour point of input image shown in Fig. 2. Left: curvature curve; Right: cumulative curvature curve

The right image of Fig. 6 shows the cumulative curvature of the left one. Obviously, the cumulative curvature curve is much smoother than the original curvature curve, which makes it much easier and more accurate in locating the local maximums. The result of maximums location is shown in Fig. 7.

Fig. 7
figure 7

The detection result of local maximums in cumulative curvature

2.3 Geometry feature analysis

The application of curvature detection alone was not enough to locate the fingertips needed. As shown by Fig. 7, the valley points between two fingers were also local maximums. Consequently, a geometry feature analysis based on convex hull analysis and convex defect detection was proposed in order to overcome this difficult point.

Giving a set containing all contour points, the convex hull of the set was detected by Graham scan method [1] (see Fig. 8).

Fig. 8
figure 8

Detection result of convex hull

After the detection of convex hull, the valley points between fingers were located by defect detection between two convex hull vertices (see Fig. 9):

Fig. 9
figure 9

Detection result of convex defect

The results of curvature detection in Fig. 7 and convex defects detection in Fig. 9 were combined, obviously revealing that the points which possessed high curvature and did not belong to the valleys between fingers were finally considered as the fingertips (see Fig. 10). Beyond that, the locations with green circles alone were fingertips, while those with both green and white circles were the ones that not only possessed high curvature but also belonged to valleys between fingers.

Fig. 10
figure 10

Fingertip detection result

2.4 Feature triangle analysis

Every two fingertip vertices and the valley vertex between them formed a triangle (as seen the red triangle in Fig. 11), called feature triangle. There were several advantages if the feature triangle was used for gesture recognition and hand tracking, including scale-invariance, rotation-invariance, convenience for dynamic analysis, and facility in gesture modeling.

Fig. 11
figure 11

Feature triangle

The gesture “right click down” (as seen in the left one of Fig. 12) and “right click up” (as seen in the right one of Fig. 12) were modeled easily using feature triangle analysis. After analyzing a video (556 frames) including several times of gesture “right click down” and “right click up”, the three interior angles and the area of the feature triangle were recorded as seen in Fig. 13.

Fig. 12
figure 12

Gesture “right click down” (left) and Gesture “right click up” (right)

Fig. 13
figure 13

The three interior angles (left) and the area of the feature triangle (right)

It is found in Fig. 13 that the internal angles and the area of the feature triangle changed significantly when the gesture changed between “right click up” and “right click up”. By performing appropriate statistical analysis on the area and interior angle of the target feature triangle in n consecutive frames, the switching between “right click up” and “right click down” was judged and recognized in real-time and accurately. Suppose {TSi, i = t − Nh, t − Nh + 1⋯, t} represents the triangle area of the frame i from t − N to t, and {i}, {i}, {i} represent the three internal angles. It is shown in Fig. 13 that vertex 1 (i) exhibited the most obvious angle change, thus which was exclusively considered in our model. \( {u}_t^{s1} \), \( {u}_t^{s2} \) represent the mean of first mh frames and last frames, and \( {u}_t^{s1} \), \( {u}_t^{s2} \), \( {u}_t^{\alpha 1} \), \( {u}_t^{\alpha 2} \) represent the mean area and angle of first mh frames and last frames, while \( {\operatorname{var}}_t^{s1} \), \( {\operatorname{var}}_t^{s2} \), \( {\operatorname{var}}_t^{\alpha 1} \), \( {\operatorname{var}}_t^{\alpha 2} \) represent the variances.

$$ {u}_t^{s1}=\frac{1}{m_h}\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}T{s}_i\kern0.5em {u}_t^{\alpha 1}=\frac{1}{m_h}\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}T{\alpha}_i $$
(19)
$$ {u}_t^{s2}=\frac{1}{m_h}\sum \limits_{i=t-{m}_h}^tT{s}_i\kern0.5em {u}_t^{\alpha 2}=\frac{1}{m_h}\sum \limits_{i=t-{m}_h}^tT{\alpha}_i $$
(20)
$$ {\operatorname{var}}_t^{s1}=\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}{\left(T{s}_i-{u}_t^{s1}\right)}^2\kern0.5em {\operatorname{var}}_t^{\alpha 1}=\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}{\left(T{\alpha}_i-{\alpha}_t^{s1}\right)}^2 $$
(21)
$$ {\operatorname{var}}_t^{s2}=\sum \limits_{i=t-{m}_h}^t{\left(T{s}_i-{u}_t^{s2}\right)}^2\kern0.5em {\operatorname{var}}_t^{\alpha 2}=\sum \limits_{i=t-{m}_h}^t{\left(T{\alpha}_i-{u}_t^{\alpha 2}\right)}^2 $$
(22)

If both the mean and variance meet the formulas (23)-(26), the occurrence of a switching from gesture “right click up” to “right click down” is considered, while if all the conditions meet the formulas (27)-(30), the occurrence of a switching from gesture “right click down” to “right click up” is considered.

$$ {u}_t^{s1}/{u}_t^{s2}\ge {T}_u^s $$
(23)
$$ {u}_t^{\alpha 1}/{u}_t^{\alpha 2}\ge {T}_u^{\alpha } $$
(24)
$$ {T}_{\mathrm{var}}^{s1}\le {\operatorname{var}}_t^{s1}/{\operatorname{var}}_t^{s2}\le {T}_{\mathrm{var}}^{s2} $$
(25)
$$ {T}_{\mathrm{var}}^{\alpha 1}\le {\operatorname{var}}_t^{\alpha 1}/{\operatorname{var}}_t^{\alpha 2}\le {T}_{\mathrm{var}}^{\alpha 2} $$
(26)
$$ {u}_t^{s2}/{u}_t^{s1}\ge {T}_u^s $$
(27)
$$ {u}_t^{\alpha 2}/{u}_t^{\alpha 1}\ge {T}_u^{\alpha } $$
(28)
$$ {T}_{\mathrm{var}}^{s1}\le {\operatorname{var}}_t^{s1}/{\operatorname{var}}_t^{s2}\le {T}_{\mathrm{var}}^{s2} $$
(29)
$$ {T}_{\mathrm{var}}^{\alpha 1}\le {\operatorname{var}}_t^{\alpha 1}/{\operatorname{var}}_t^{\alpha 2}\le {T}_{\mathrm{var}}^{\alpha 2} $$
(30)

3 Experiments and discussions

To demonstrate the effectiveness and robustness of the proposed method, two different kind of experiment conditions were tested. In the first set of experiments, a still camera (SONY CCD named EX-FCB48) was used to test the detecting and tracking accuracy and robustness. In the other set of experiments, a Pan-Tilt-Zoom (PTZ) camera was used for gesture recognition and hand servo tracking in real-time.

The proposed method was implemented in C++ using the OpenCV library and ran on a 1.8 GHz Pentium Dual-Core CPU, 2Gbyte DDR memory. In the experiments, the point number of a sub contour point set was set to be 21, namely L = 10, and the sample number N = 20.

3.1 Fingertip detection experiments

To demonstrate the effectiveness and robustness of the proposed fingertip detection method, 120 different hand images which included different amounts of visible fingertips, hand gestures, skin colors and races were used in this experiment. The proposed method also was compared with other three commonly used methods, including traditional curvature method [27], the centroid circle method [16], the convex hull and defect analysis method [9].

Qualitative Comparison: As shown in Figs. 14, 15, 16 and 17, traditional curvature method was sensitive to the hand segment noise, and prone to false detection at the local minimum; the centroid circle method was better than traditional curvature method yet prone to miss detection, and finger was bent obviously; the convex hull and defect analysis method was prone to false detection at some convex hull vertex; while the proposed method was better than the above three commonly used fingertip detection methods.

Fig. 14
figure 14

Fingertip detection result of traditional curvature method

Fig. 15
figure 15

Fingertip detection result of centroid circle method

Fig. 16
figure 16

Fingertip detection result of convex hull and defect analysis method

Fig. 17
figure 17

Fingertip detection result of the proposed method

Quantitative Comparison: The percentage listed in Table 1 represented the success rate of fingertip detection, which was defined as the number of correctly detected fingertip images divided by the total number of the fingertip images used in the experiments. It was shown that the proposed method was better than the other three methods. The table also presented the time cost of every method, which showed that the proposed method consumed more time than the other three methods, since the Monte Carlo Sampling was a relatively time-consuming method. Moreover, the time-consuming extent increased with the sampling number, which was a disadvantage of the proposed method yet negligible on use of a better computer.

Table 1 Experimental Result of Fingertip detection using different methods

3.2 Detecting and tracking using still camera

The experimental results are shown in Fig. 18. Each frame was divided into four views, the upper left corner was the original input video, the upper right corner was the fingertip detection result, the lower left corner was the feature triangle extraction result, and the lower right corner was the result of simulated handwriting. In experimental result, the red triangle was the target feature triangle, the green dot was the fingertip point, and the green and white dots were the inter-finger points.

Fig. 18
figure 18

The tracking results using a still camera

From the first frame to the 102nd frame was a process in which the hand extended from the outside of the camera into the middle of the camera, thus all this process was in a non-writing (right click up) state. From the 102nd frame to the 142st frame a blending process of the right finger simulated the action of “right click down”, and the state was converted to the handwriting state in the 142th frame when the corresponding lower right corner view drew the current fingertip position. From the 142th frame to the 236th frame, the action of “right click up” was detected. In the handwriting state, the fingertip positions were recorded and plotted in the lower right corner, while in the non-handwriting state, the fingertips positions were ignored. In the process of writing three numbers, a total of six state switching processes occurred, and all of which were detected accurately and timely by the proposed algorithm.

It was revealed by the experimental results that the numbers written down were not very continuous and smooth, since the lower right corner view of the experiment merely recorded the fingertips in each frame, which were just a series of discrete points, and more detailed explanations were shown in the literatures [32,33,34,35]. Meanwhile, the distances between points were not identical, since the moving speed of the fingertip was not uniform although the time interval between each frame was constant 25 milliseconds. Therefore the faster the writing speed, the larger the interval between two points while the slower the writing speed, the smaller the interval and sometimes the two points even overlapped if the speed was slow enough. In this case, if the adjacent two points were connected by a straight line, the written words were possibly continuous, and if some smoothing algorithms are further adopted, the hand-writing will be smoother and more continuous.

The tracked fingertip positions were compared with the ground truth as shown in Fig. 19, where the ground truth positions were recorded manually frame by frame. The Euclidian distance between the tracked fingertip and the ground truth is shown in the left one of Fig. 20. The results showed that the tracked fingertip positions were accurate, and the average position error only possessed 5.48 pixels. The average algorithm time was 48.45 ms (as seen in right one of Fig. 20), and accorded with the requirement of real-time applications.

Fig. 19
figure 19

Fingertip position tracking results, Black: ground truth, Red: detected position

Fig. 20
figure 20

Fingertip position error and algorithm time

3.3 Detecting and tracking using PTZ camera

The hardware of PTZ camera servo control system was composed of a PTZ camera, a video capture card, a PC, and a RS232-485 converter, as seen in Fig. 21. The PTZ camera used in this system was designed and developed by ourselves. It possessed three degrees of freedom, including Pan (0~360°), Tilt (0~90°), Zoom (1-18x optical).

Fig. 21
figure 21

The hardware of the PTZ camera servo control system

The servo control model is shown in Fig. 22. The objective of the servo control system was to keep the target in the middle field of camera through driving the PTZ camera in every frame. In this control model, there were two system delay, τ1 and τ2. τ1 was the video system time, and in this system τ1 was 40 ms (25fps) due to the usage of a PAL CCD. τ2 was the algorithm time and the value of which less than τ1 made it full real-time. If τ2 was between 40 and 100 ms, namely reaching more than 10 frames per second, the model was still able to achieve good results in practical applications (Table 2).

Fig. 22
figure 22

PTZ camera servo control system

Table 2 Parameters used in the experiments

Figure 23 is the tracking results using a PTZ camera. From the first frame to the 95th frame was a process in which the hand extended from the outside into the middle of the camera, thus all this process was in a non-tracking state. In the 142th frame, a gesture state switching from “right click up” to “right click down” was detected, and the servo tracking system began to drive the PTZ camera to keep the fingertip in the middle of the camera, as is shown in the 214th frame. While a gesture state switching from “right click down” to “right click up” was detected, as is shown in 395th frame, the servo control system stopped tracking the fingertip and waited for the next “right click down” action to be detected. In the experiment, a total of three state transitions were experienced, and the algorithm made a correct judgment on its state switching, while the hand and fingertips always remained in the middle of the image while tracking.

Fig. 23
figure 23

The tracking results using a PTZ camera

Figure 24 shows the tracking result of the experiment. The blue line shows the horizontal and vertical coordinates of the tracked fingertips, while the green and red dashed line represent the ideal position range. The camera resolution was 360*288, the ideal horizontal position range was 175~185, and the ideal vertical position range was 139~149. The result shows that during the whole tracking process, the hand and the fingertips always kept in the middle of the camera.

Fig. 24
figure 24

The horizontal and vertical coordinates of the tracked fingertip

Figure 25 shows the algorithm time. It is revealed that the time in most frames was between 60 ms and 80 ms, while in some frames it was much more than the average time, since the time consuming of fingertip detection using cubic Bezier curve depended on the hand region segments, that is, the more points in the hand contour, the more time consumed.

Fig. 25
figure 25

The Euclidian distance between the tracked fingertip and the image center (left); The algorithm time (right)

4 Conclusion

This paper presented a real-time dynamic gesture recognition and hand tracking using a PTZ camera. It was aimed to achieve a robust scheme that stably recognized simple hand gestures and tracked the hand using a PTZ camera to keep the fingertip always in the middle field of the camera. Firstly, a new approach was proposed to detect fingertip by estimating the curvature of hand contour points based on the Monte Carlo Sampling Bezier curve fitting, which was much more robust than existing curvature-based methods against noise. Secondly, a feature triangle analysis method was utilized to recognize the simple dynamic gesture in real-time. Thirdly, the proposed fingertip detection and feature triangle analysis were applied to a self-made PTZ camera to realize servo tracking of the target fingertip when the gesture “right click down” was detected. As shown by the experimental results, the proposed approach successfully recognized the dynamic gesture and located the fingertips’ position precisely, and realized follow-up servo tracking using PTZ camera in real-time.

Moreover, the curvature estimation approach using Cubic Bezier Curve fitting based on Monte Carlo Sampling can be easily extended to other areas (e.g. industrial visual inspection) where the curvature needs to be estimated accurately and quickly.