Keywords

1 Introduction

Sign Languages (SL) are visuo-gestural languages used mainly by the deaf community. Very few linguist studies have been produced to explain and formalize their rules and grammar. The first contemporary linguist to study SL was Stokoe [17] who described the language in terms of phonemes (or cheremes) and built a written transcription of it. This work has laid the groundwork and paved the way for deeper research on SL. Today, linguists collect and annotate videos of signers in natural contexts in order to extract knowledge from them. Currently, most of the SL videos are manually annotated using a software like ELAN [21] or ANVIL [10]. Though this process consumes a lot of time and the produced annotations are usually non-reproducible since they depend on the subjectivity and the experience of the annotator. An automatic annotation system could certainly accelerate the work and enhance the reproducibility of the results.

Cues about the hands (shape and motion), face expressions, gaze orientation, mouthing are useful to be annotated. When talking about SL, most of the non SL signers tend to think directly of the hands. In fact, two communication channels exist: Manual Components (MC) consisting of the hand shapes, orientations and motions and, Non-Manual Components (NMC) consisting of face features and body pose. Cuxac’s model [5] presents two ways of signifying using a combination of 4 different MC and 4 NMC:

  1. (1)

    “saying and showing” with an illustrative intent which consists of Highly Iconic Structures (HIS) that include Transfers in Sizes and Shapes (TSS) of objects, in Situations (ST) and in Persons (PT);

  2. (2)

    “without showing” which consists of Lexical Signs (LS), i.e. predefined signs in a dictionary, and pointing. More than 65% of the signs are lexical [8].

Thus, distinguishing these two classes, LS and HIS, would be a first step before applying a dedicated processing for each of them. To do so, this paper proposes to determine the more relevant body and face features to detect LS in a SL discourse. Then, it illustrates this result by testing a classification method. The experiments are made on a French Sign Language (LSF) video dataset consisting of standard RGB videos. Thus, it is not required to use any specific sensor or wearable device, which enlarges its possible use-cases. This pre-annotation intends to facilitate the work of the linguists and can be useful to constitute annotated data for further deep learning strategies.

The remainder of the paper is structured as follows. The next section discusses the previous work conducted on annotation of SL videos. Section 3 describes the datasets that are studied in this paper. Then, Sects. 4 and 5 describe respectively the features and classification methods. To finish, Sect. 6.3 discusses the results.

2 Related Work

In the literature, few papers explored the automatic annotation of SL. The majority of these works on SL annotation study the American Sign Language (ASL) as it is the richest one in terms of databases. The conducted studies on ASL may not be necessarily suitable nor applicable to LSF since SL are not universal languages: each country has its own SL and grammar.

The first attempts of SL recognition were conducted on isolated signs. The general idea was to extract some features from images in order to identify signs using a classifier such as SVM [7, 16], neural networks [9], HMM [1], KNN [15]. These works were mainly focusing on the MC as features as it was believed that hands had the main information in an SL speech. Nowadays, many image processing and object segmentation techniques were developed [20] and with the revolution in machine learning, the systems are capable of estimating and tracking face [18] and body [19] in real time with high success rate using only 2D image features. Most works on SL recognition focus on specific datasets, made in controlled environments (uniform background, signer with dark clothes) and dealing with a specific topic, such as weather [11]. But the real challenge in SL recognition remains in identifying dynamic signs, i.e. signs in real SL speech, and most importantly independently of the signer [12]. Such work requires a huge annotated dataset, which is not available for LSF.

Concerning automatic annotation, most of the proposals try to annotate the segments by describing facial and body events as mouthing, gaze, occlusion, hands placements, handshapes, and movements [14]. Few of them go further and exploit these events by combining MC and NMC to add a second level annotation such as LS and HIS. In fact, [6] succeeded to annotate pointing in LSF videos by combining MC and NMC. In [13], the MC and NMC are tracked in order to categorize LS. However, an actual annotation of LS was lacking, and the tracking of NMC was done on controlled videos of the head. In our work, we tested some combinations of MC and NMC to figure out which components are the most effective to classify LS.

3 Data

The dataset is a portion of MOCAP dataset, which collects RGB videos in LSF produced in our lab for other purposesFootnote 1. The videos show the signer from hip up face view. These videos are standard (2D, \(720\times 540\) pixels, 25 FPS). We used 49 videos with 4 different signers with randomly picked combinations for learning and test sets. The length of videos varies between 15 to 34 s (average of 24 s, 19.63 min in total). In the videos the signers were asked to describe what they see in an image (Fig. 1). The given images represented 25 different scenes (Fig. 2) such as a living room, a forest, a wine store, a library, a city, a monument, a construction site... The images were chosen to have a variety of LS and HIS. All the videos were annotated manually by one expert. The annotations include gaze, LS and HIS. 1011 signs were annotated, 709 were LS and 304 were HIS.

Fig. 1.
figure 1

Sequence of lexical sign “Salon”

Fig. 2.
figure 2

Examples of scenes to be described by the signers

4 Features Extraction

As far now, linguists did not establish a unified way for annotation nor a predefined list of MC and NMC to track. Checking the literature, most papers were interested in studying the handshapes, their placements, motion, direction and symmetry between them as MC and mouthing, mouth gestures, gaze and eyebrows as NMC.

To extract the features, we use OpenPose [4], a recent real time pose estimation library for face and body, which provides the coordinates of keypoints (body articulators and face elements). We have processed these coordinates to provide more evolved features described hereafter.

Mouthing. Based on the work of [3] which proves that mouth features are important indicators of LS, our first work has consisted in tracking the mouthing. Then other MC and NMC features were successively added to see how the classification improves. OpenPose provides the coordinates of 20 points that define the outer-line of the lips (Fig. 3(a)).

Fig. 3.
figure 3

(a) Facial Keypoints of OpenPose. (b) Relative movements of hands. (c) Placement of signs.

We assume that a mouthing is detected whenever the signer opens his mouth due to the pronunciation of a vowel, which is not the case for mouth gesture. To detect the opening of the mouth, we calculated the isoperimetric ratio (or circularity) of the interior of the lips using the formula: \( \text {IR} = \frac{4{\pi }a}{p^{2}} \) where a is the area and p is the perimeter. The higher the ratio is, the more the mouth is open, which is the sign of a mouthing.

However, mouth is often occluded when the signs are formed in front of the head. To handle the problem, a temporal analyzing window of 5 frames is used in which the last relevant IR value is kept, i.e. \(\mathrm{{IR}} \le 1\). The occlusion is detected when the distance between hands and mouth falls under a threshold (80 pixels in our dataset) and when lips coordinates are null.

Gaze/Head Direction. Gaze plays an important role in HIS, when the signer places objects in the signing space in front of him, and wants to draw the attention of the partner on something in the signing space.

Our facial features are detected using OpenFace [2]. Theoretically, the gaze could be tracked from this model. However, because of the low resolution of the images under consideration, we had to use only the head direction (which is generally close). We define the head direction as a ratio, where 0 refers to the signer head in center position, and negative/positive values stands for left and right respectively.

Bi-manual Motion. During HIS, the signer can draw objects in the signing space, generally with both hands moving in a symmetrical or opposite way. With OpenPose, we can get with high precision the coordinates of both wrists and elbows. Using these coordinates we deduce the velocity and direction of the hands movements to create a motion characteristic vector for each arm. The correlation of the two vectors of the two arms can give us an information about the relative movements of arms: symmetrical (both velocity and direction are similar), opposite (similar velocity and opposite directions).

Signing Space. The LS, generally known by the interlocutor, are mostly made in front of the signer. Contrary to the HIS (transfers), they require less placement of objects in the signing space (left and right). The abscissas of neck and wrists, found in each frame by OpenPose (Fig. 3(b) and (c)) are used to evaluate this location. Therefore we simply tested if the abscissa of the neck is between the abscissas of both wrists. If it is the case then the sign is centered if not the sign is either to the left or to the right.

5 Lexical Classification

Since we do not have a huge dataset for learning, a simple classifier has been chosen, instead of convolutional neural networks. The first step in building our classifier was finding the decision rule. Using the extracted features from the learning data and combining them with the annotations of LS and HIS, we drew the distribution of each parameter between the two types of signs along the frames of the videos. Since the values of our features are continuous, we took the assumption that their distributions are normal with mean \( \mu _k \) and variance \( \sigma _k^{2} \). In Fig. 4, it can be seen how the features values (for instance IR) distributed between LS and HIS for a specific learning set. These functions represent the probability distribution of each feature \( (x) \) given a sign type \( (C) \). \( P(x_i\mid C) \) can be computed by plugging \( x_i \) into the equation for a Normal distribution parameterized by \( \mu _k \) and \( \sigma _k^{2} \)

Fig. 4.
figure 4

Distribution of isoperimetric ratio IR between the two types of signs (LS and HIS.

$$\begin{aligned} P(x = x_i\mid C) = \frac{1}{\sqrt{2\pi \sigma _k^{2}}}e^{-\frac{(x_i - \mu _k)^{2}}{2\sigma _k^{2}}} \end{aligned}$$
(1)

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is the most probable; this is known as the maximum a posteriori or MAP decision rule. The corresponding Bayes classifier assigns a class label \({\displaystyle {\hat{y}}=C_{k}} \) for some k as follows:

$$\begin{aligned} {\displaystyle {\hat{y}}={{\text {argmax}} }}_{k\in \{1,\dots ,K\}}\ {P(C_{k})\displaystyle \prod _{i=1}^{n}p(x_{i}\mid C_{k}).}\end{aligned}$$
(2)

with the believe that all the features are independents.

After creating our model using the learning dataset, for each new frame in the testing dataset we calculate:

$$\begin{aligned} {P(Lexical\mid F1,F2,F3,F4)= P(Lexical) \prod _{i=1}^{4}P(x_{i}\mid Lexical).}\end{aligned}$$
(3)
$$\begin{aligned} {P(HIS\mid F1,F2,F3,F4)= P(HIS) \prod _{i=1}^{4}P(x_{i}\mid HIS).} \end{aligned}$$
(4)

where F1 is Mouthing, F2 is head pose, F3 is hands symmetry and F4 is sign placement. Then we compare (3) and (4), if the result of (3) is bigger then the result of (4) the new frame is part of LS if not then it is part of HIS sign.

6 Experiments and Results

6.1 Preliminary Analysis

The manual annotations provided in MOCAP were useful in a first time to establish some statistics about the signs. We were most interested in the signs frequencies and their lengths. We discovered that 69.99% of signs in the database are lexical where 30% are HIS and that the standard length of a sign is between 3 and 10 frames, as shown by the distribution of the sign lengths on Fig. 5.

Fig. 5.
figure 5

The distribution of sign lengths in the dataset MOCAP.

6.2 Evaluation

The proposed method is applied to the dataset detailed in Sect. 3.

The classification results of LS are compared to the manual annotations. Fig. 6 shows, for one of the videos, and for each frame, an example of classification result (in red) compared to the annotation (in blue). Because of the subjectivity of the annotations, an annotated LS is considered as correctly detected when 3 consecutive frames (smallest sign length of a LS) classified as lexical fall in the range of the annotated sign.

Fig. 6.
figure 6

Counting of false/true positives and false/true negatives (Color figure online)

For the evaluation metrics, we counted the true positives (TP) among detected lexical signs, false positives (FP), true negatives (TN) and false negatives (FN) in each video in the test dataset. Then we compute the TP and TN rates (TPR and TNR), the positive prediction value (PPV) and F1-score:

$$\begin{aligned}TPR={\displaystyle \frac{TP}{TP+FN} \quad TNR= \displaystyle \frac{TN}{TN+FP} \quad PPV= \displaystyle \frac{TP}{TP+FP} \quad F_{1}=2 {\frac{ {PPV} . {TPR} }{ {PPV} + {TPR} }}} \end{aligned}$$

6.3 Classification Results

First, the results of our method are evaluated for each signer individually and then combined to check if the classification is independent of the signer. For each experiment, the Mouthing (M) is tested alone, and the other ones are successively added: Head direction (H), Bi-manual motion (B) and Sign placement (S).

Intra-signer Study. For each signer, the videos are divided into 3 subsets \(L_1\), \(L_2\), and \(L_3\). Two of them \((L_i, L_j)= (L_1, L_2), (L_1, L3), (L_2, L_3)\) are used for learning and the last one for testing. A cross-validation is performed, by collecting the results of each experiment. The averages and standard deviations of the results are shown in Table 1.

Table 1. The evaluation of the results for intra-signer classification using the features Mouthing (M), Head direction (H), Bi-manual motion (B) and Sign placement (S)). The shown values are the average of all the results coming from each signer separately

Inter-signer Study. Here, the videos are divided into 4 subsets \(L_1\), \(L_2\),\(L_3\) and \(L_4\), each subset includes all the videos from the same signer. Again we tried all the different combinations of subsets for learning and testing with three subsets for learning and one subset for testing and the results are shown in Table 2.

Table 2. The evaluation of the results for inter classification using the features Mouthing (M), Head direction (H), Bi-manual motion (B) and Sign placement (S)). The shown values are the average of all the results coming from all signers combined

By analysing the Tables 1 and 2, mouthing and head orientation appear to be the most relevant features for distinguishing LS from HIS. While, the bimanual signing and the placement of signs seem not adding any relevant information for this task. The similarity between the results obtained for intra-signer and for inter-signer experiments confirms the generality of our approach. The performance of the results seems to be low compared to more standard gesture recognition applications. This is explained by the huge variety of the motion made for signs, the imperfection and subjectivity of the annotations and the error margin of OpenPose and OpenFace during the features extraction since we are working on low resolution videos. However, our application consists of a semi automatic annotation of SL. It will be of great help for linguists, who will just have to confirm or not the correctness of the classification.

6.4 Impact of the Segmentation

As mentioned previously, the manual annotations of the videos are both subjective and imprecise. Each annotator has his own rules to define the beginning and the end of each sign. We wanted to find how many of the False Positive classified LS actually refer to a neighbour existing sign in the annotation to test the hypothesis that this classified sign was considered as False detection due to the subjectivity of the annotation and a delay of between the annotation and the detection (Fig. 7). Thus we enlarged each detected sign by 3 frames (smallest length of a LS) at the beginning and the end of the sign and recalculated the evaluation results. The new values in Table 3 show an improvement of the classification rate. Even if the improvement is not that high it definitely makes us more curious about the importance of the segmentation of detected signs.

Fig. 7.
figure 7

Before enlarging detected signs (upper image) and after (lower image)

Table 3. The evaluation of the results for inter and intra classification after enlarging the detected signs

7 Conclusion

This paper has proposed a tool that will be useful for linguists to pre-annotate Sign Language (SL) videos, in order to alleviate the annotation burden. This first step distinguishes temporal segments that correspond to lexical signs from other segments, such as the highly iconic ones. According to the study made on the features, it has been shown that mouthing and head orientation are the most discriminant features for this task. This work has several perspectives. First, the impact of other features will be tested and other classifiers such as SVM will be used just to compare the results and observe the impact of the classification system on the results. Then, once a lexical sign is detected in the video, we will have to refine the temporal segmentation around this detection. After segmentation, it will be possible to launch a sign recognition algorithm on the resulting LS segments. It will be interesting also to test our approach on other SL, in order to test its universality.