Real-time dynamic gesture recognition and hand servo tracking using PTZ camera

Cao, Songxiao; Wang, Xuanyin

doi:10.1007/s11042-019-07869-7

Real-time dynamic gesture recognition and hand servo tracking using PTZ camera

Published: 18 June 2019

Volume 78, pages 27403–27424, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Real-time dynamic gesture recognition and hand servo tracking using PTZ camera

Download PDF

471 Accesses
4 Citations
Explore all metrics

Abstract

A technology of real-time dynamic gesture recognition and hand tracking using a Pan-Tilt-Zoom (PTZ) camera was presented in this study. It was aimed to achieve robust scheme that stably recognized simple hand gestures and tracked the hand by means of a PTZ camera to keep the fingertip remaining in the center of the camera. For this purpose, the hand region was initially segmented in a cluttered environment using skin color segmentation in YCbCr color space to get the silhouette of the hand. Furthermore, the Monte Carlo Sampling method was used to estimate the Cubic Bezier curves best fitted to the sub contour points centralized in each contour point, and the fingertips were detected by combining the local maximums of a cumulative curvature with detection of convex defects. After that, feature triangle analysis was utilized to achieve dynamic recognition of simple gestures including “right click down” and “right click up”. Finally, the PTZ camera was driven by the algorithm to achieve servo tracking with the target fingertip when the gesture “right click down” was detected. As is shown by the experimental results, the proposed approach recognized the dynamic gestures and located the fingertips’ positions precisely, and realized the follow-up servo tracking using PTZ camera in real-time.

A Novel Hand Gesture Recognition System

System Control Using Real Time Finger Tip Tracking and Contour Detection with Gesture Recognition

Hand Gesture Control System for Basic PC Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, hand detection, gesture recognition and tracking applied to perceptive user interfaces and virtual reality have gained growing attention [7, 23, 25]. Even though hand posture recognition (static hand gesture) has been reported to be very successful in HCI (Human Computer Interaction), dynamic gesture is continually faced with many obstacles for accurate recognition and tracking in single camera due to following reasons [24]. Initially, the modeling of hand gesture is difficult resulting from the deformable, articulated structure of hands and the self-occlusion problem. Secondly, the cluttered environment and varying lighting conditions lead to inaccurate hand segmentation. Thirdly, the hand appearances vary among people of different races, ages, weights, and so on. In order to solve these problems, several hand gesture recognition and tracking methods have been proposed and specifically categorized into the following two classes.

Firstly, hand gestures are modeled and recognized using geometrical characteristics of the segmented hand region [4, 5, 18, 37]. In this method, once the segmented image is obtained, convex hull and convexity defects are calculated so as to find the area of hand and finger position, and whereafter vertices of the convex hull are considered as the fingertips. It causes some mistakes when the vertices represent other corners of hand silhouette rather than fingertips, which leads to mistakes in gesture recognition and tracking. Secondly, the depth image is used for gesture recognition and hand tracking by some researchers [12, 28, 31]. The advantage of 3D cameras lies in higher discrimination between gestures and background while the disadvantage lies in relatively lower resolution. In addition, under the condition of equivalent resolution, the price of 3D cameras is much higher than that of ordinary cameras. Thirdly, deep learning has drawn much attention nowadays in gesture recognition and hand tracking [8, 19, 29]. Methods based on deep learning performs excellently in some particular gestures through long time training, whereas the time consumed in real-time applications is unsatisfactory. Fourthly, other algorithms based on models are used in some literatures. In literature [24], the authors proposed a centroid tracking of hand gestures that captured and retained time sequence information for feature extraction. Zhang [36] used a multi-cue integration hand tracking method by integrating the motion and color cues from a feature point selection view.

In the process of human-computer interaction, there is a problem that has always been existing yet received hardly any attention. That is, the human-computer interaction system using a static camera will stop work after the human gesture leaves the camera’s field of view. Actually, a PTZ camera is a desirable option to solve this problem. Nevertheless, none of these literatures focused on the usage of a PTZ camera combined with a dynamic hand tracking algorithm to achieve servo tracking.

Distinguished from the existing approaches, a real-time dynamic gesture recognition and hand tracking using a PTZ camera was proposed in this study, and the aim of which was to achieve a robust scheme that stably recognized simple hand gestures, and tracked the hand using a PTZ camera to keep the fingertip remaining in the center of the camera.

Specifically, the hand region in a cluttered environment was segmented using skin color segmentation in YCbCr color space and by which the silhouette of the hand was got. For every point in the silhouette, a sub contour point set was constructed, and the Monte Carlo Sampling method was used to obtain the best fitted Cubic Bezier curve to the sub contour set. Subsequently, the curvature on the continuous Cubic Bezier curve of the corresponding point was considered as the estimated curvature of the point in the silhouette. Giving the estimated curvatures, the local maximums of a cumulative curvature curve were detected as the candidate fingertips. After that, geometry feature analysis including convexity approach and feature triangle analysis were used to locate the final fingertip and recognize several specific simple gestures. Finally, if a tracking gesture (right click down) was recognized, the system drove the PTZ camera to track the fingertip, and kept the fingertip always in the middle field of view of the camera in real-time. The overview of the proposed method is shown in Fig. 1.

The innovative points of the proposed method included:

A new fingertip detection method was proposed based on Monte Carlo sampling Cubic Bezier curve fitting.
A feature triangle analysis was used to realize simple dynamic gesture recognition.
The gesture recognition was combined with a PTZ camera to achieve servo tracking in real-time.

This paper is organized as follows. Section 2 details the proposed gesture recognition method including hand region detection and extraction, fingertip detection based on Monte Carlo Sampling cubic Bezier curve fitting, and simple gesture modeling using feature triangle analysis. Furthermore, Section 3 introduces the experimental results and analyses. Last but not least, Section 4 outlines the conclusions and future research.

2 The proposed gesture recognition and tracking method

2.1 Hand region detection and extraction

The first step of finger detection was detecting the hand region which exerted an essential role in the whole process since a complete and accurate hand region laid the foundation for further analysis. In the proposed method, this was achieved through skin color segmentation in the YCbCr color space which was perceptually uniform and widely used in video compression standards. Additionally, it performed well in the separation of luminance (Y) and chrominance (Cb, Cr) together with the compactness of the skin cluster. An image with RGB color space was transformed to YCbCr color space using the following equation:

$$ \left[\begin{array}{c}Y\\ {} Cb\\ {} Cr\end{array}\right]=\left[\begin{array}{c}16\\ {}128\\ {}128\end{array}\right]+\left[\begin{array}{ccc}65.481& 128.553& 24.996\\ {}-37.797& -74.203& 112\\ {}112& -93.768& -18.214\end{array}\right]\left[\begin{array}{c}R\\ {}G\\ {}B\end{array}\right] $$

(1)

In this paper, the extraction of hand region from background was accomplished using a threshold technique proposed by Chai and Ngan [6]. Beyond that, the threshold range of Cb and Cr in our method was narrower than that in [6], which was R_Cb = [77, 127] and R_Cr = [133, 155]. It was aimed to minimize the effect of background noise.

After the skin detection and extraction, a morphological filtering process including an erosion operation and a dilation operation was conducted to obtain some more complete candidate regions. Subsequently, the largest region was chosen as the desired hand region (see Fig. 2).

2.2 Fingertip detection based on Monte Carlo sampling Bezier curve fitting

As the first step of gesture recognition, fingertip detection exerted a crucial role in this technology and the previously proposed methods of which were specifically categorized as follows.

Firstly, the fingertip detection was handled by correlation techniques using a circular mask in which the fingertip was modeled as a cylinder with a hemispherical cap [3, 13, 14, 21, 22, 30]. Subsequently, normalized correlation was used with a template of a properly sized circle corresponding to a user’s fingertip size. The segmentation of the hand must be very accurate in order to achieve a good detection result under this model and fulfill the requirement of the particular shape of an approximate cylinder. Secondly, the curvature local maximum on the boundary of the silhouette was used as the feature to detect fingertips [2, 15, 20, 26]. The curvature was calculated using the discrete points on the silhouette in all these curvature-based methods, which led to the high sensitivity to noises such as local valleys caused by incompleteness of hand region segmentation. Thirdly, convexity approach was used for finger detection and hand gesture recognition in some latterly literatures, by which convex hull and convexity defects were calculated so as to find the area of hand and finger position as soon as the segmented image was obtained [4, 10]. And then, the vertices of the convex hull were considered as the fingertips, which caused mistakes when the vertices represented other corners of the hand silhouette rather than fingertips. Fourthly, an algorithm based on the distance of the contour points to the hand position was also used in some literatures [11, 17]. In this algorithm, it was assumed that the points with longer distance to the center of the hand were considered as fingertips. However, a hand was deformable and the center of which was probably far away from the expected position if the gesture was not regular, which thus led to an error in fingertip location.

Differed from the existing approaches, an innovative way using Cubic Bezier Curve fitting based on Monte Carlo Sampling was proposed in this study for fingertip detection, by which the robustness to noise and the accuracy of fingertips location were significantly improved.

2.2.1 Bezier curve presenting

After obtaining the contour of the object hand, it was essential to calculate the curvature of every point in the contour as well as locate the local maximums.

The contour points set is {CP_m} 1 ≤ m ≤ M, where M is the total number of the points of the contour, and a sub contour points set of CP_m including 2L + 1 points is set as follows:

$$ \left\{C{P}_{m+j}^s\right\}\kern0.48em 1\le m\le M\kern0.36em -L\le j\le L $$

(2)

They are centered in contour point CP_m and consist of L contour points before CP_m as well as L contour points after CP_m (see Fig. 3). A fitted curve to the sub contour points set was discovered and the curvature of the fitted curve at middle point was the estimated curvature of the contour at CP_m. The problem was transformed into seeking for a robust and efficient curve fitting method to the sub contour points set $ C{P}_{m+j}^s $. In addition, a ‘Bezier curve Monte Carlo Sampling and fitting method’ with high fitting accuracy as well as computation efficiency was proposed.

Bezier curves were widely used in computer graphics for smooth curve modeling. The curve was completely contained in the convex hull of its control points, thus the points were graphically displayed and used to manipulate the curve intuitively. Affine transformations, such as translation and rotation were applied to the curve by using respective transformation on the control points of the curve.

An nth-degree Bezier curve is represented by the following equation:

$$ P(t)=\sum \limits_{i=0}^n{P}_i{B}_{i,n}(t)\kern0.36em t\in \left[0,1\right] $$

(3)

where P_i refers to the control points, n refers to the degree, and t refers to the parameter. The basis function, B_{i, n}(t), is the classical nth-degree Bernstein polynomials defined explicitly by:

$$ {B}_{i,n}(t)={C}_n^i{t}^i{\left(1-t\right)}^{n-i}=\frac{n!}{i!\left(n-i\right)!}{t}^i{\left(1-t\right)}^{n-i} $$

(4)

In consideration of the accuracy and calculative efficiency, degree three (n = 3), or the Cubic Bezier (see Fig. 4), was adopted in the current study. Namely, four points were required to confirm a Cubic Bezier, in which two were off the curve and called the control points while the other two were on both ends of the curve and called the end points. Giving the sub contour points set $ C{P}_{m+j}^s $, the curve fitting problem was to find four optimal control points to define the fitting curve. An easy way is was to use exhaustive search method through the possible space, yet this brute force method was time-consuming. In this paper, the Monte Carlo method was proposed to find the optimal control points quickly.

2.2.2 Monte Carlo sampling

The Monte Carlo method provides approximate solutions to various mathematical problems by performing statistical sampling experiments on a computer. It applies to problems with absolutely no probabilistic content as well as to those with inherent probabilistic structure. The Cubic Bezier curve fitting problem in this work is formulated [21] as:

$$ \underset{x\in X}{\min}\left\{f(x):= E\left[F\left(x,\xi \right)\right]\right\} $$

(5)

Where F(x, ξ) is a function of two vector variables x ∈ ℝⁿ and ξ ∈ ℝ^d.X ∈ ℝⁿ is a given set. In this work, it means the sub contour points set $ C{P}_{m+j}^s $. ξ ∈ ξ(ω) is a random vector, which means a candidate sub contour $ P{\left(\omega, t\right)}_m^s $, defined by randomly generated control points vector ω. The expectation in Eq. (5) is taken with respect to the probability distribution of ξ which is assumed to be known. The function F(x, ξ) is formulated as:

$$ F\left(x,\xi \right)=\sum \limits_{j=0}^{2L+1}{\left|P\Big(\omega, t\Big){}_m^s-C{P}_{m+j}^s\right|}^2 $$

(6)

The objective of the function is to find the best fitted Bezier curve to the given sub contour points set using the least square method. Since ξ has a finite number of possible scenarios with positive probabilities p_k,k = 1, ...K, the expected value is written as:

$$ f(x)=\sum \limits_{k=1}^K{p}_kF\left(x,{\xi}_k\right) $$

(7)

However, using exhaustive discretization of the probability distribution lead to an exponential growth of the number of scenarios. Therefore, the Monte Carlo Sampling technique was used to reduce the computational complexity. The proposed approach was accomplished by generating random sequence sub Bezier curve contours ξ¹, ξ², ... in the sample space, which were independent of each other and uniformly distributed (i.e. independent identically distributed). With the generated N random samples ξ¹, ξ², ...ξ^N, the sample’s average function is written as:

$$ {\hat{f}}_N(x):= \frac{1}{N}\sum \limits_{i=1}^NF\left(x,{\xi}^i\right) $$

(8)

Indeed, $ {\hat{f}}_N(x) $ is an unbiased estimator of f(x). Moreover, according to the law of large number, $ {\hat{f}}_N(x) $ is a consistent estimator of f(x) when N → ∞. Subsequently, the Cubic Bezier curve fitting problem in Eq. (5) is approximated:

$$ \underset{x\in X}{\min}\left\{{\hat{f}}_N(x):= \frac{1}{N}\sum \limits_{i=1}^NF\left(x,{\xi}^i\right)\right\} $$

(9)

Specifically, according to the feature of Cubic Bezier curve, the start control point P₀ and the end control point P₃ are on the curve, while P₁ and P₂ are off the curve. Giving the sub contour points set $ C{P}_{m+j}^s $, the four control points of each candidate sub contour are decided by the randomly generated control points vector ω^ζ, ζ = 1, 2...N where N signifies the total random sample number. Consequently, a candidate sub contour using Cubic Bezier curve is expressed as:

$$ {P}_{\zeta }{\left({\omega}^{\zeta },t\right)}_m^s=\sum \limits_{i=0}^3{P}_i{\left({\omega}^{\zeta}\right)}_m{B}_{i,3}(t)=\left[{t}^3\kern0.5em {t}^2\kern0.5em t\kern0.5em 1\right]\;\left[\begin{array}{cccc}-1& 3& -3& 1\\ {}3& -6& 3& 0\\ {}-3& 3& 0& 0\\ {}1& 0& 0& 0\end{array}\right]\;\left[\begin{array}{l}{\omega}_{0m}^{\zeta}\\ {}{\omega}_{1m}^{\zeta}\\ {}{\omega}_{2m}^{\zeta}\\ {}{\omega}_{3m}^{\zeta}\end{array}\right] $$

(10)

The control points $ {\omega}_{0m}^{\zeta } $, $ {\omega}_{1m}^{\zeta } $, $ {\omega}_{2m}^{\zeta } $ and $ {\omega}_{3m}^{\zeta } $ are uniformly random variables which can be drawn from the following range independently:

$$ {\displaystyle \begin{array}{l}{\omega}_{0m}^{\zeta}\in \left[C{P}_{m-L}^s-{T}_0,C{P}_{m-L}^s+{T}_0\right]\\ {}{\omega}_{1m}^{\zeta}\in \left[{T}_{\mathrm{min}}-{T}_0,{T}_{\mathrm{max}}+{T}_0\right]\\ {}{\omega}_{2m}^{\zeta}\in \left[{T}_{\mathrm{min}}-{T}_0,{T}_{\mathrm{max}}+{T}_0\right]\\ {}{\omega}_{3m}^{\zeta}\in \left[C{P}_{m+L}^s-{T}_0,C{P}_{m+L}^s+{T}_0\right]\end{array}} $$

(11)

where T₀ denotes a consistent threshold which defines a tolerance interval of the corresponding range, and T_min, T_max are described as:

$$ {\displaystyle \begin{array}{l}{T}_{\mathrm{min}}=\min \left\{C{P}_{m+j}^s\right\}\kern0.36em -L\le j\le L\\ {}{T}_{\mathrm{max}}=\max \left\{C{P}_{m+j}^s\right\}\kern0.36em -L\le j\le L\end{array}} $$

(12)

The Eq. (11) means that the start control point and the end control point of a candidate Cubic Bezier curve are supposed to be within an area centered in the corresponding point of the given sub contour $ C{P}_{m-j}^s $ and $ C{P}_{m+j}^s $, while the other two control points should possess equal chance to appear in any position of the whole sample space defined by $ C{P}_{m+j}^s $ (see Fig. 5). The start and end control point are not put exactly at the point $ C{P}_{m-j}^s $ and $ C{P}_{m+j}^s $, since the hand contour extraction is probably so accurate and the position of corresponding contour point possibly contains some uncertainty, which is thus modeled by sampling the points within an area centered in $ C{P}_{m-j}^s $ and $ C{P}_{m+j}^s $.

2.2.3 Cubic Bezier curvature and local maximum extraction

Using the above-described Monte Carlo Sampling method (i.e. sampling randomly in the sample space N times), a set of control points which fit the given sub contour $ C{P}_{m+j}^s $ best are discovered as $ \left\{{P}_i^{best},i=0,1,2,3\right\} $. After that, the curvature at any point in the Bezier curve is derived by the following equation:

$$ K(t)=\frac{\mid P\hbox{'}(t)\times P\hbox{'}\hbox{'}(t)\mid }{{\left|P\hbox{'}(t)\right|}^3} $$

(13)

Where the first derivative P ' (t) and the second derivative P ' ' (t) of the Bezier curve are represented as follows:

$$ P\hbox{'}(t)=\sum \limits_{i=0}^n{P}_i^{best}{B}_{i,n}\hbox{'}(t)=n\sum \limits_{i=1}^n\left({P}_i^{best}-{P}_{i-1}^{best}\right){B}_{i-1,n-1}(t) $$

(14)

$$ P\hbox{'}\hbox{'}(t)=n\left(n-1\right)\sum \limits_{i=0}^{n-2}\left({P}_{i+2}^{best}-2{P}_{i+1}^{best}+{P}_i^{best}\right){B}_{i,n-2}(t) $$

(15)

Specifically, the first and the second derivative of Cubic Bezier curve are written as:

$$ P\hbox{'}(t)=3\left[\left({P}_1^{best}-{P}_0^{best}\right){\left(1-t\right)}^2+\left({P}_2^{best}-{P}_1^{best}\right)t\left(1-t\right)+\left({P}_3^{best}-{P}_2^{best}\right){t}^2\right] $$

(16)

$$ P\hbox{'}\hbox{'}(t)=6\left[\left(-{P}_0^{best}+3{P}_1^{best}-3{P}_2^{best}+{P}_3^{best}\right)t+\left({P}_0^{best}-2{P}_1^{best}+{P}_2^{best}\right)\right] $$

(17)

As mentioned above, the given sub contour set is centered in contour point CP_m and consists of L contour points before CP_m as well as L contour points after CP_m. That is to say, the curvature of the point CP_m can be seen as the curvature of the middle point in the best fitted Bezier curve defined by $ \left\{{P}_i^{best},i=0,1,2,3\right\} $. Specifically, K(t = 0.5) is assumed to be the curvature that we need.

Since the curvatures of all contour points are calculated, the local maximums with a relatively high probability to be the fingertips are extracted. However, the curvatures obtained by the Monte Carlo Sampling method still contain some noises that make the extraction of local maximums still very difficult (see the left image of Fig. 6). In order to address this problem, a smooth process is introduced to reconstruct a new curvature K^c (also called cumulative curvature), which is described as follows:

$$ {K}_i^c=\sum \limits_{j=i}^{i+2L}{K}_j $$

(18)

The right image of Fig. 6 shows the cumulative curvature of the left one. Obviously, the cumulative curvature curve is much smoother than the original curvature curve, which makes it much easier and more accurate in locating the local maximums. The result of maximums location is shown in Fig. 7.

2.3 Geometry feature analysis

The application of curvature detection alone was not enough to locate the fingertips needed. As shown by Fig. 7, the valley points between two fingers were also local maximums. Consequently, a geometry feature analysis based on convex hull analysis and convex defect detection was proposed in order to overcome this difficult point.

Giving a set containing all contour points, the convex hull of the set was detected by Graham scan method [1] (see Fig. 8).

After the detection of convex hull, the valley points between fingers were located by defect detection between two convex hull vertices (see Fig. 9):

The results of curvature detection in Fig. 7 and convex defects detection in Fig. 9 were combined, obviously revealing that the points which possessed high curvature and did not belong to the valleys between fingers were finally considered as the fingertips (see Fig. 10). Beyond that, the locations with green circles alone were fingertips, while those with both green and white circles were the ones that not only possessed high curvature but also belonged to valleys between fingers.

2.4 Feature triangle analysis

Every two fingertip vertices and the valley vertex between them formed a triangle (as seen the red triangle in Fig. 11), called feature triangle. There were several advantages if the feature triangle was used for gesture recognition and hand tracking, including scale-invariance, rotation-invariance, convenience for dynamic analysis, and facility in gesture modeling.

The gesture “right click down” (as seen in the left one of Fig. 12) and “right click up” (as seen in the right one of Fig. 12) were modeled easily using feature triangle analysis. After analyzing a video (556 frames) including several times of gesture “right click down” and “right click up”, the three interior angles and the area of the feature triangle were recorded as seen in Fig. 13.

It is found in Fig. 13 that the internal angles and the area of the feature triangle changed significantly when the gesture changed between “right click up” and “right click up”. By performing appropriate statistical analysis on the area and interior angle of the target feature triangle in n consecutive frames, the switching between “right click up” and “right click down” was judged and recognized in real-time and accurately. Suppose {TS_i, i = t − N_h, t − N_h + 1⋯, t} represents the triangle area of the frame i from t − N to t, and {Tα_i}, {Tβ_i}, {Tγ_i} represent the three internal angles. It is shown in Fig. 13 that vertex 1 (Tα_i) exhibited the most obvious angle change, thus which was exclusively considered in our model. $ {u}_t^{s1} $, $ {u}_t^{s2} $ represent the mean of first m_h frames and last frames, and $ {u}_t^{s1} $, $ {u}_t^{s2} $, $ {u}_t^{\alpha 1} $, $ {u}_t^{\alpha 2} $ represent the mean area and angle of first m_h frames and last frames, while $ {\operatorname{var}}_t^{s1} $, $ {\operatorname{var}}_t^{s2} $, $ {\operatorname{var}}_t^{\alpha 1} $, $ {\operatorname{var}}_t^{\alpha 2} $ represent the variances.

$$ {u}_t^{s1}=\frac{1}{m_h}\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}T{s}_i\kern0.5em {u}_t^{\alpha 1}=\frac{1}{m_h}\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}T{\alpha}_i $$

(19)

$$ {u}_t^{s2}=\frac{1}{m_h}\sum \limits_{i=t-{m}_h}^tT{s}_i\kern0.5em {u}_t^{\alpha 2}=\frac{1}{m_h}\sum \limits_{i=t-{m}_h}^tT{\alpha}_i $$

(20)

$$ {\operatorname{var}}_t^{s1}=\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}{\left(T{s}_i-{u}_t^{s1}\right)}^2\kern0.5em {\operatorname{var}}_t^{\alpha 1}=\sum \limits_{i=t-{N}_h}^{t-{N}_h+{m}_h}{\left(T{\alpha}_i-{\alpha}_t^{s1}\right)}^2 $$

(21)

$$ {\operatorname{var}}_t^{s2}=\sum \limits_{i=t-{m}_h}^t{\left(T{s}_i-{u}_t^{s2}\right)}^2\kern0.5em {\operatorname{var}}_t^{\alpha 2}=\sum \limits_{i=t-{m}_h}^t{\left(T{\alpha}_i-{u}_t^{\alpha 2}\right)}^2 $$

(22)

If both the mean and variance meet the formulas (23)-(26), the occurrence of a switching from gesture “right click up” to “right click down” is considered, while if all the conditions meet the formulas (27)-(30), the occurrence of a switching from gesture “right click down” to “right click up” is considered.

$$ {u}_t^{s1}/{u}_t^{s2}\ge {T}_u^s $$

(23)

$$ {u}_t^{\alpha 1}/{u}_t^{\alpha 2}\ge {T}_u^{\alpha } $$

(24)

$$ {T}_{\mathrm{var}}^{s1}\le {\operatorname{var}}_t^{s1}/{\operatorname{var}}_t^{s2}\le {T}_{\mathrm{var}}^{s2} $$

(25)

$$ {T}_{\mathrm{var}}^{\alpha 1}\le {\operatorname{var}}_t^{\alpha 1}/{\operatorname{var}}_t^{\alpha 2}\le {T}_{\mathrm{var}}^{\alpha 2} $$

(26)

$$ {u}_t^{s2}/{u}_t^{s1}\ge {T}_u^s $$

(27)

$$ {u}_t^{\alpha 2}/{u}_t^{\alpha 1}\ge {T}_u^{\alpha } $$

(28)

$$ {T}_{\mathrm{var}}^{s1}\le {\operatorname{var}}_t^{s1}/{\operatorname{var}}_t^{s2}\le {T}_{\mathrm{var}}^{s2} $$

(29)

$$ {T}_{\mathrm{var}}^{\alpha 1}\le {\operatorname{var}}_t^{\alpha 1}/{\operatorname{var}}_t^{\alpha 2}\le {T}_{\mathrm{var}}^{\alpha 2} $$

(30)

3 Experiments and discussions

To demonstrate the effectiveness and robustness of the proposed method, two different kind of experiment conditions were tested. In the first set of experiments, a still camera (SONY CCD named EX-FCB48) was used to test the detecting and tracking accuracy and robustness. In the other set of experiments, a Pan-Tilt-Zoom (PTZ) camera was used for gesture recognition and hand servo tracking in real-time.

The proposed method was implemented in C++ using the OpenCV library and ran on a 1.8 GHz Pentium Dual-Core CPU, 2Gbyte DDR memory. In the experiments, the point number of a sub contour point set was set to be 21, namely L = 10, and the sample number N = 20.

3.1 Fingertip detection experiments

To demonstrate the effectiveness and robustness of the proposed fingertip detection method, 120 different hand images which included different amounts of visible fingertips, hand gestures, skin colors and races were used in this experiment. The proposed method also was compared with other three commonly used methods, including traditional curvature method [27], the centroid circle method [16], the convex hull and defect analysis method [9].

Qualitative Comparison: As shown in Figs. 14, 15, 16 and 17, traditional curvature method was sensitive to the hand segment noise, and prone to false detection at the local minimum; the centroid circle method was better than traditional curvature method yet prone to miss detection, and finger was bent obviously; the convex hull and defect analysis method was prone to false detection at some convex hull vertex; while the proposed method was better than the above three commonly used fingertip detection methods.

Quantitative Comparison: The percentage listed in Table 1 represented the success rate of fingertip detection, which was defined as the number of correctly detected fingertip images divided by the total number of the fingertip images used in the experiments. It was shown that the proposed method was better than the other three methods. The table also presented the time cost of every method, which showed that the proposed method consumed more time than the other three methods, since the Monte Carlo Sampling was a relatively time-consuming method. Moreover, the time-consuming extent increased with the sampling number, which was a disadvantage of the proposed method yet negligible on use of a better computer.

Table 1 Experimental Result of Fingertip detection using different methods

Full size table

3.2 Detecting and tracking using still camera

The experimental results are shown in Fig. 18. Each frame was divided into four views, the upper left corner was the original input video, the upper right corner was the fingertip detection result, the lower left corner was the feature triangle extraction result, and the lower right corner was the result of simulated handwriting. In experimental result, the red triangle was the target feature triangle, the green dot was the fingertip point, and the green and white dots were the inter-finger points.

From the first frame to the 102nd frame was a process in which the hand extended from the outside of the camera into the middle of the camera, thus all this process was in a non-writing (right click up) state. From the 102nd frame to the 142st frame a blending process of the right finger simulated the action of “right click down”, and the state was converted to the handwriting state in the 142th frame when the corresponding lower right corner view drew the current fingertip position. From the 142th frame to the 236th frame, the action of “right click up” was detected. In the handwriting state, the fingertip positions were recorded and plotted in the lower right corner, while in the non-handwriting state, the fingertips positions were ignored. In the process of writing three numbers, a total of six state switching processes occurred, and all of which were detected accurately and timely by the proposed algorithm.

It was revealed by the experimental results that the numbers written down were not very continuous and smooth, since the lower right corner view of the experiment merely recorded the fingertips in each frame, which were just a series of discrete points, and more detailed explanations were shown in the literatures [32,33,34,35]. Meanwhile, the distances between points were not identical, since the moving speed of the fingertip was not uniform although the time interval between each frame was constant 25 milliseconds. Therefore the faster the writing speed, the larger the interval between two points while the slower the writing speed, the smaller the interval and sometimes the two points even overlapped if the speed was slow enough. In this case, if the adjacent two points were connected by a straight line, the written words were possibly continuous, and if some smoothing algorithms are further adopted, the hand-writing will be smoother and more continuous.

The tracked fingertip positions were compared with the ground truth as shown in Fig. 19, where the ground truth positions were recorded manually frame by frame. The Euclidian distance between the tracked fingertip and the ground truth is shown in the left one of Fig. 20. The results showed that the tracked fingertip positions were accurate, and the average position error only possessed 5.48 pixels. The average algorithm time was 48.45 ms (as seen in right one of Fig. 20), and accorded with the requirement of real-time applications.

3.3 Detecting and tracking using PTZ camera

The hardware of PTZ camera servo control system was composed of a PTZ camera, a video capture card, a PC, and a RS232-485 converter, as seen in Fig. 21. The PTZ camera used in this system was designed and developed by ourselves. It possessed three degrees of freedom, including Pan (0~360°), Tilt (0~90°), Zoom (1-18x optical).

The servo control model is shown in Fig. 22. The objective of the servo control system was to keep the target in the middle field of camera through driving the PTZ camera in every frame. In this control model, there were two system delay, τ₁ and τ₂. τ₁ was the video system time, and in this system τ₁ was 40 ms (25fps) due to the usage of a PAL CCD. τ₂ was the algorithm time and the value of which less than τ₁ made it full real-time. If τ₂ was between 40 and 100 ms, namely reaching more than 10 frames per second, the model was still able to achieve good results in practical applications (Table 2).

Table 2 Parameters used in the experiments

Full size table

Figure 23 is the tracking results using a PTZ camera. From the first frame to the 95th frame was a process in which the hand extended from the outside into the middle of the camera, thus all this process was in a non-tracking state. In the 142th frame, a gesture state switching from “right click up” to “right click down” was detected, and the servo tracking system began to drive the PTZ camera to keep the fingertip in the middle of the camera, as is shown in the 214th frame. While a gesture state switching from “right click down” to “right click up” was detected, as is shown in 395th frame, the servo control system stopped tracking the fingertip and waited for the next “right click down” action to be detected. In the experiment, a total of three state transitions were experienced, and the algorithm made a correct judgment on its state switching, while the hand and fingertips always remained in the middle of the image while tracking.

Figure 24 shows the tracking result of the experiment. The blue line shows the horizontal and vertical coordinates of the tracked fingertips, while the green and red dashed line represent the ideal position range. The camera resolution was 360*288, the ideal horizontal position range was 175~185, and the ideal vertical position range was 139~149. The result shows that during the whole tracking process, the hand and the fingertips always kept in the middle of the camera.

Figure 25 shows the algorithm time. It is revealed that the time in most frames was between 60 ms and 80 ms, while in some frames it was much more than the average time, since the time consuming of fingertip detection using cubic Bezier curve depended on the hand region segments, that is, the more points in the hand contour, the more time consumed.

4 Conclusion

This paper presented a real-time dynamic gesture recognition and hand tracking using a PTZ camera. It was aimed to achieve a robust scheme that stably recognized simple hand gestures and tracked the hand using a PTZ camera to keep the fingertip always in the middle field of the camera. Firstly, a new approach was proposed to detect fingertip by estimating the curvature of hand contour points based on the Monte Carlo Sampling Bezier curve fitting, which was much more robust than existing curvature-based methods against noise. Secondly, a feature triangle analysis method was utilized to recognize the simple dynamic gesture in real-time. Thirdly, the proposed fingertip detection and feature triangle analysis were applied to a self-made PTZ camera to realize servo tracking of the target fingertip when the gesture “right click down” was detected. As shown by the experimental results, the proposed approach successfully recognized the dynamic gesture and located the fingertips’ position precisely, and realized follow-up servo tracking using PTZ camera in real-time.

Moreover, the curvature estimation approach using Cubic Bezier Curve fitting based on Monte Carlo Sampling can be easily extended to other areas (e.g. industrial visual inspection) where the curvature needs to be estimated accurately and quickly.

References

Anderson KR (1978) A reevaluation of an efficient algorithm for determining the convex hull of a finite planar set. Inf Process Lett 7(1):53–55
Article MathSciNet MATH Google Scholar
Argyros AA, Lourakis MIA (2006) Vision-based interpretation of hand gestures for remote control of a computer mouse. Lect Notes Comput Sci 3979:40–51
Article Google Scholar
Barrho J, Adam M, Kiencke U (2006) Finger localization and classification in images based on generalized hough transform and probabilistic models. Int Conf Control Autom Robot Vision IEEE 1–6
Barros P et al (2017) A dynamic gesture recognition and prediction system using the convexity approach. Comput Vis Image Understanding 155:139–149
Article Google Scholar
Bhuyan MK, Neog DR, Kar MK (2011) Hand pose recognition using geometric features. Commun IEEE 1-5
Chai D, Ngan KN (1998) Locating facial region of a head-and-shoulders color image. IEEE Int Conf Automatic Face Gesture Recognition 1998. Proc IEEE 124–129
Erol A et al (2007) Vision-based hand pose estimation: a review. Comput Vis Image Underst 108(1):52–73
Article Google Scholar
Ge SS, Yang Y, Lee TH (2008) Hand gesture recognition and tracking based on distributed locally linear embedding. Image Vis Comput 26.12:1607–1620
Article Google Scholar
Gurav RM, Kadbe PK (2015) Real time finger tracking and contour detection for gesture recognition using Open CV. Proc IEEE Int Conf Ind Instrum Control 974–977
Hartanto R, Kartikasari A (2017) Android based real-time static Indonesian sign language recognition system prototype. Int Conf Inf Technol Electr Eng IEEE 1–6
Jo KH, Kuno Y, Shirai Y (1998) Manipulative hand gesture recognition using task knowledge for human computer interaction. Int Conf Face Gesture Recogn IEEE Comput Soc 468
Kim J et al (2017) An adaptive local binary pattern for 3D hand tracking. Pattern Recogn 61:139–152
Article Google Scholar
Koike H, Sato Y, Kobayashi Y (2001) Integrating paper and digital information on enhanced desk: a method for realtime finger tracking on an augmented desk system. ACM Trans Comput Hum Interact (TOCHI) 8(4):307–322
Article Google Scholar
Letessier J (2004) Visual tracking of bare fingers for interactive surfaces. ACM Symp User Interface Software Technol ACM 119–122
Malik S, Laszlo J (2004) Visual touchpad: a two-handed gestural input device. 289–296
Malima A, Qzgur E, Cetin M (2006) A fast algorithm for vision-based hand gesture recognition for robot control. Proc 14th IEEE Conf Signal Process Commum Appl 1–4
Mo Z, Lewis JP, Neumann U (2005) Smart Canvas:a gesture-driven intelligent drawing desk system. Int Conf Intell User Interf. January 10-13, 2005, San Diego, California, Usa DBLP 239–243
Morshidi M, Tjahjadi T (2014) Gravity optimised particle filter for hand tracking. Pattern Recogn 47(1):194–207
Article Google Scholar
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Article Google Scholar
O'Hagan R, Zelinsky A (2000) Visual gesture interfaces for virtual environments. User Interface Conference, 2000. AUIC 2000. First Australasian IEEE 73–80
Oka K, Sato Y, Koike H (2002) Real-time fingertip tracking and gesture recognition. Comput Graphics Appl IEEE 22(6):64–71
Article Google Scholar
Oka K, Sato Y, Koike H (2002) Real-time tracking of multiple fingertips and gesture recognition for augmented desk interface systems. 情報処理学会論文誌コンピュータビジョンとイメージメディア(cvim) 44.568:25–32
Pavlovic VI, Sharma R, Huang TS (1997) Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans Pattern Anal Mach Intell 19(7):677–695
Article Google Scholar
Premaratne P, Yang S, Vial P, Ifthikar Z (2017) Centroid tracking based dynamic hand gesture recognition using discrete hidden Markov models. Neurocomputing 228:79–83
Article Google Scholar
Sagayam KM, Hemanth DJ (2016) Hand posture and gesture recognition techniques for virtual reality applications: a survey. Virtual Reality 21(2):1–17
Google Scholar
Segen J, Kumar S (1998) Gesture VR: vision-based 3D hand interace for spatial interaction. ACM International Conference on Multimedia '98, Bristol, England, September DBLP 455–464
Segen J, Kumar S (1998) Human-computer interactions using gesture recognition and 3D hand tracking. Proc IEEE Int Conf Image Process. 188–192
Suau X et al (2014) Real-time fingertip localization conditioned on hand gesture classification. Image Vis Comput 32.8:522–532
Article Google Scholar
Tsironi E et al (2017) An analysis of convolutional long-short term memory recurrent neural networks for gesture recognition. Neurocomputing 268
Hardenberg C Von (2001) Bare-hand human-computer interaction. The Workshop on Perceptive User Interfaces ACM 1–8
Wu X et al (2015) Depth image-based hand tracking in complex scene. Optik Int J Light Electron Opt 126.20:2757–2763
Article Google Scholar
Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) A fast Uyghur text detector for complex background images. IEEE Trans Multimed 20(12):3389–3398
Article Google Scholar
Yan C, Li L, Zhang C, Liu B, Zhang Y, Dai Q (2019) Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans Multimed 1–1
Yan C, Li Z, Zhang Y, Qin P, Ji X, Dai Q (2019) Depth image denoising using nuclear norm and learning graph model. IEEE Trans Multimed
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimed
Zhang X et al (2015) Robust hand tracking via novel multi-cue integration. Neurocomputing 157:296–305
Article Google Scholar
Zhou Y, Jiang G, Lin Y (2015) A novel finger and hand pose estimation technique for real-time hand gesture recognition. Pattern Recogn 49.C:102–114
Google Scholar

Download references

Author information

Authors and Affiliations

College of Metrology and Measurement Engineering, China Jiliang University, Hangzhou, 310018, China
Songxiao Cao
State Key Laboratory of Fluid Power and Mechatronic System, Zhejiang University, Hangzhou, 310027, China
Xuanyin Wang

Authors

Songxiao Cao
View author publications
You can also search for this author in PubMed Google Scholar
Xuanyin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Songxiao Cao implemented the core algorithm, designed all the experiments, addressed the resulting data and drafted the manuscript. Xuanyin Wang participated in the design and construction of the system and helped draft the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Songxiao Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, S., Wang, X. Real-time dynamic gesture recognition and hand servo tracking using PTZ camera. Multimed Tools Appl 78, 27403–27424 (2019). https://doi.org/10.1007/s11042-019-07869-7

Download citation

Received: 09 October 2018
Revised: 23 April 2019
Accepted: 05 June 2019
Published: 18 June 2019
Issue Date: 15 October 2019
DOI: https://doi.org/10.1007/s11042-019-07869-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Real-time dynamic gesture recognition and hand servo tracking using PTZ camera

Abstract

Similar content being viewed by others

A Novel Hand Gesture Recognition System

System Control Using Real Time Finger Tip Tracking and Contour Detection with Gesture Recognition

Hand Gesture Control System for Basic PC Features

1 Introduction

2 The proposed gesture recognition and tracking method

2.1 Hand region detection and extraction

2.2 Fingertip detection based on Monte Carlo sampling Bezier curve fitting

2.2.1 Bezier curve presenting

2.2.2 Monte Carlo sampling

2.2.3 Cubic Bezier curvature and local maximum extraction

2.3 Geometry feature analysis

2.4 Feature triangle analysis

3 Experiments and discussions

3.1 Fingertip detection experiments

3.2 Detecting and tracking using still camera

3.3 Detecting and tracking using PTZ camera

4 Conclusion

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real-time dynamic gesture recognition and hand servo tracking using PTZ camera

Abstract

Similar content being viewed by others

A Novel Hand Gesture Recognition System

System Control Using Real Time Finger Tip Tracking and Contour Detection with Gesture Recognition

Hand Gesture Control System for Basic PC Features

Explore related subjects

1 Introduction

2 The proposed gesture recognition and tracking method

2.1 Hand region detection and extraction

2.2 Fingertip detection based on Monte Carlo sampling Bezier curve fitting

2.2.1 Bezier curve presenting

2.2.2 Monte Carlo sampling

2.2.3 Cubic Bezier curvature and local maximum extraction

2.3 Geometry feature analysis

2.4 Feature triangle analysis

3 Experiments and discussions

3.1 Fingertip detection experiments

3.2 Detecting and tracking using still camera

3.3 Detecting and tracking using PTZ camera

4 Conclusion

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation