1 Introduction

Sign language (SL) is the main communication way for most deaf or hearing-loss people. As of 2013, approximately 1.1 billion people worldwide have varying degrees of hearing loss [1], 124 million of whom are moderate and severe [2]. Unlike the general view that sign language is just the direct gesture description of objects or a translation of the spoken(written) language, sign language is actually a natural language with its own linguistics, words, sentences, and grammar. Sign languages were created and developed in the deaf communities relatively independently, although they are influenced impressively by the cultures and languages of the hearing societies. Sign language is culturally and geographically diverse, much like all other natural languages. Sign language was not widely believed as a natural language until Stokoe’s publication of sign language structure in the 1960s [3]. This is the first time that sign language was studied with linguistic methodology. Stokoe presented persuasive evidence that American Sign Language (ASL) is indeed a natural language with grammar and vocabulary independent of English.

Sign language recognition (SLR) research began to emerge and received increasing attention in the 1990s, which can improve communication between the hearing and the deaf. Researches on sign language were initially mostly done along the same technical lines as speech recognition. Numerous effective methods for speech recognition were directly transferred to SLR. However, asynchronous integration of multimodal articulators is a tough and frequently ignored issue in SLR. As shown in Stokoe’s [3] sign language model, five individual components are present, i.e. movement, location, orientation, handshape, and facial expression.

The hidden Markov model (HMM) [4] is the most classical and well-known method for SLR, which can model the transfers between latent states of sequence data. Various variants of HMM, such as maximum entropy Markov model (MEMM) [5], conditional random fields (CRF) [6], product-HMM [7], and hierarchy-HMM [8], have also been widely used in SLR studies. However, due to the lack of a unified definition of sign language syllables or primitives so far, HMM methods are generally only applicable to simple, small-category, isolated SLR. In recent years, SLR researches based on deep neural networks (e.g. GRU [9], TGCN [10], I3D [11]) have became more and more popular. Deep neural networks’ enormous number of parameters allows them to accommodate sign language’s intricate spatio-temporal structures. A significant amount of training data is frequently necessary to learn a practical, powerful statistical model.

However, labelled data is a scarce resource for sign language due to the enormous cost of transcribing these unwritten languages. The majority of publicly accessible sign language databases to date have been created in lab settings. A small corpus of signs will be selected or designed first, and each sign will then be performed repeatedly by a number of signers (native or non-native). Some databases use data-gloves to capture every detail of arms and fingers [12,13,14], while others use RGB or RGB-depth cameras to record the sign language videos [11, 15]. Signers are typically instructed to wear coloured gloves in order to improve the robustness of palm detection during the creation of visual-based sign language databases. These laboratory-environment databases usually have restricted size and variation due to the high cost of performance and annotation. In fact, more and more videos of TV news, press conferences, and parliamentary speeches have been supplemented with real-time sign language interpretation. There are a lot of sign language-interpreted videos on the Internet (e.g. video websites, online deaf communities). Unfortunately, the supervisory information of these sign language-interpreted videos is weak and noisy due to the lexical differences, time misalignment, and grammatical inconsistencies between sign language and speech language..

Data mining is a process for knowledge discovery that can draw out latent patterns, relationships, and correlations from massive data. Data mining methods have been largely used in many fields, from medical to social life. For example, influenza epidemics can be detected early by analysing large numbers of Google search queries in a population [16], and Kawasaki disease clinical graph signs can be detected by a deep convolutional neural network (CNN) [17]. The characteristic and necessary motions can be extracted using motif-guided attention networks [18]. Then, if the unlabelled signs can be minded from plenty of existing sign language-interpreted videos, it will be a promising solution to the scarcity of labelled sign language resources.

The goal of this work is to propose a novel framework to automatically learn a large-scale sign language database from these sign language-interpreted videos. We achieved this by exploiting supervisory information available in the subtitles (or audio) of the videos through learning shapelets which are discriminative subsequences of time series that best predict the target variable. The main contributions of this paper are:

  • We provide a framework that can automatically create a sign language database from massive sign language-interpreted videos by integrating pose extraction, subtitle parsing, shapelet mining, and sample augmentation.

  • We introduce the tricks of iterative update and matrix operation to greatly speed up the searching of shapelets.

  • We propose a strategy that combines shapelet searching and shapelet learning to benefit both speed and accuracy.

  • We demonstrate that adaptive sample augmentation can greatly improve the database’s size, variety, and balance.

The rest of paper will be organized as follows: Section 2 summarizes the related works of SLR, sign language databases, multiple instances learning and sample augmentation. Section 3 gives definitions of the main terminology and provides an overall introduction to the proposed framework. Section 4 describes in detail how to process the original sign language videos (including subtitles) and how to generate training samples for a given word based on supervised information. Section 5 demonstrates how to speed up the target sign extraction of two available shapelet discovering algorithms. The construction of a concrete database and experiments on recall rate and classification in Sect. 6 serve to illustrate the usefulness and efficiency of our proposed framework. The experimental results and the impacts of the parameters are analyzed and discussed in Sect. 7. Finally, Sect. 8 concludes the whole work and provides an outlook for future work.

2 Related works

Our work relates to several themes in the literature, such as sign language recognition, sign language databases, multiple instance learning, and sample augmentation.

2.1 Sign language recognition

The study of automatic SLR has developed for about 30 years since the 1990s. Since non-motion features (i.e. facial expression [19, 20]) are difficult to identify and quantify, most of the researches only focus on motion features. Shape-based SLR studies focus on the space features of hands and poses through the designed descriptors of trajectories and shapes [21,22,23,24]. HMMs [4,5,6, 25] have been the dominant method used to model state transfers of sign language sequences until the advent of deep learning. Deep neural networks such as CNNs [26, 27], RNNs [9, 28,29,30], TGCN [10], and Transformers [31, 32] have been proven to be effective architectures to model the complex spatio-temporal structures of sign language.

Meanwhile, kinds of body models [33,34,35,36] have been proposed to extract human skeletons from images. In particular, the sequence convolution network in [35, 36] will map images into confidence maps of skeleton keypoints. Additionally, its accessible trained model can be used without further training to perform a real-world posture prediction task, making video-based SLR more convenient and efficient. However, I3D [11], a sign language model that convolves videos along both spatial and temporal dimensions has shown its superiority in video-based SLR [10, 37,38,39]. In this work, three state-of-the-art models (GRU [9], TGCN [10], and I3D [11]) were adopted to evaluate our final database.

2.2 Sign language databases

A summary and review of some outdated sign language databases can be found in [40]. And here, we will introduce a couple of the most representative sign language databases currently available. Purdue RVL-SLLL [41] contains 104 words and 1834 samples, which were performed by 14 native signers in a laboratory environment under controlled lighting. RWTH-Boston [42] contains three subsets of 50, 104, and 400 signs, which are implemented with two to five native signers to perform isolated words and continuous sentences. DeviSign is a large-scale word-level sign language database containing up to 2000 Chinese sign languages and 24,000 RGB-depth recordings performed by eight non-native signers in a laboratory environment (controlled background) [15]. These databases, regardless of size, are built in a laboratory environment, which is highly expensive and mostly with limited size and variation.

Both MSASL [37] and WLASL [10] are large-scale word-level sign language databases that collect isolated signs from the Internet. However, there are significant differences between the isolated signs and the co-articulated signs during “naturally” continuous signing. Also, there are far fewer isolated sign language videos on the Internet than the continuous sign language videos. BSL-1k [38] is a sign language database that contains a vocabulary of 1064 signs extracted from more than 1000 h of continuous BBC sign language TV programmes. The extraction of signs relies only on lip changes without concern for body movements. Such a sign language database construction framework, which relies exclusively on lip-synthesis models, has low data utilization and is difficult to migrate directly to the database construction of other sign languages.

2.3 Multiple instance learning

Multiple instance learning (MIL) problem belongs to a weak supervisory problem, inferring the label of individual instances with the given labels of bags of instances [43, 44]. Paper [45] shows that localizing the target signs for a given word using subtitles is essentially a two-class MIL problem, and a sliding window classifier is also proposed to find the clip with the highest score as the target sign. Paper [46] makes two improvements to [45]: first, it shrinks the search space by exploiting the co-occurrences of lip movements, and second, it trains a discriminative model of the target signs using the MIL-SVM (support vector machine) [47] method. The Aprior Mining algorithm was adopted to infer the rules between signs and words with positive and negative discrete-encoded sign language video clips in the paper [48]. And in [49], the correspondence between isolated signs and words was implicitly learned through modelling the translation process between subtitle sentences and their video clips with the transformer model. In the latest study [39], the author used a designed embedding architecture and the InfoNCE [50] loss to train a feature model for the target signs. The architecture comprises an I3D spatio-temporal trunk network [11] attached with a MIL trunk consisting of three linear layers separated by leaky ReLu activation and a skip connection.

The concept of shapelet proposed in [51] is actually a solution to the MIL problem discussed above. The shapelet and its nearest subsequence in each positive sample are the desired target signs. In theory, finding a shapelet requires searching all possible subsequences [51], similar to the sliding window classifier in [45]. However, there are many tricks, such as early abandoning, reuse of computation, parallel computing, and candidate filtering, that can be used to speed up the searching process [52,53,54,55,56]. Furthermore, there are works that use all of the shapelet values as unknown parameters, and then use gradient descent methods to learn the shapelet values [57, 58]. There are also studies that add additional constraints during the learning process to make the shapelet have more expected characteristics [59, 60].

In this work, we will find the target signs for a given word from the perspective of solving shapelet with weakly supervised information.

2.4 Samples augmentation

Data augmentation, which generates new samples by slightly varying the available samples, is very common in model training. The sample augmentation of signs is a special case of data augmentation that takes all similar instances of a given sample in the data source as the augmented samples [61]. These augmented samples obtained from real data source have more reasonable, rich, and realistic variations. The experiments in [61] show that the augmented samples obtained from arbitrarily collected unlabelled sign language videos can make a given sign sample have the ability of one-shot learning. Papers [38, 39] also show that although more noisy samples will be introduced if the criteria of the sign spotting are relaxed, more training samples can bring better recognition evaluation results. The experiments in our paper are also consistent with this.

3 Notation and overview

3.1 Definition and notations

To make the description of our work clear and distinct, some key terms are defined as follows:

Definition 1

Time series. A time series T is a list of data points ordered along time, \(T=t_1, t_2, \dots , t_{N}\), each data point is a feature vector.

Definition 2

Subsequence. A subsequence is a slice of consecutive data points cut from a time series T. For example, \(S=T_{k:k+m}=t_k, t_{k+1}, \dots t_{k+m-1}\) is a subsequence that cut at \(k_{th}\) point and with length of m.

Definition 3

Dis. Dis is defined as the Euclidean square distance function between two time series with same length.

$$\begin{aligned} Dis(T, R) = \sum _{i=1}^{N} (t_i-r_i)^2 \end{aligned}$$
(1)

where N is the length of two time series.

Definition 4

subDis. subDis is defined as a distance function between two time series with different length. The shorter time series is query sequence Q, the longer time series is searching sequence T. subDis(QT) returns the minimum of the Euclidean square distances between Q and all \(\vert Q\vert\)-length subsequences of T. That is:

$$\begin{aligned}&subDis(Q,T) = \min Dis(Q,S), \quad \forall S \in {\mathbb {S}}_T^{\vert Q \vert }\end{aligned}$$
(2)
$$\begin{aligned}&{\mathbb {S}}_T^{\vert Q\vert } = \left\{ T_{k:k+\vert Q\vert } \ \vert \ 1\le k \le \vert T\vert -\vert Q\vert +1\right\}. \end{aligned}$$
(3)

Definition 5

word \(\mathbf { \& }\) sign. To avoid the confusion of terminology of the two natural languages. We specify that the term “word” refers to the “isolated word” of the written (spoken) language, and the term “sign” refers to the “isolated word” of the sign language.

Definition 6

Subtitle frame. The minimum unit element of a subtitle file. It consists of three basic components: short text, begin timestamp, and end timestamp. Which represents the interpreted information of the video clip from begin timestamp to end timestamp.

Definition 7

Candidate Time Window. A potential time window that contains the target “sign” for a given “word”. Because of the rough timing of the subtitles and the intricate link between the two natural language, it is difficult to pinpoint the exact location of the target sign.

3.2 Overview of framework

The proposed framework’s flowchart is shown in Fig. 1. The framework receives files containing sign language videos and their subtitles as input, and then a sign language database is constructed as output. There are four main steps in the framework. We will describe each of these steps below.

Step one is the sign motion features extraction. For an input video, the region around the signer will be cropped first. Then the keypoints of his/her hands and upper body will be detected and tracked. Finally, the sign motion features can be designed based on the positions of keypoints. Step two prepares the training data. A corpus consisting of reasonable words is constructed from the subtitle files. Then, for each given word from the corpus, the same number of positive and negative samples determined by the designed candidate window are chosen as the training data. Step three learns the shapelet from the training data with shapelet searching and shapelet Net methods. The shapelet is the subsequence that best classifies the training samples using subDis distances. Step four is sample augmentation. To increase the sample number per word and the variations of the final database, all motion subsequences that are very similar to the learned shapelet will be appended to the final database as augmented samples. Almost all the parameters of the whole framework can be learned automatically.

Fig. 1
figure 1

The flowchart of the sign language database auto-construction framework

4 Automatic generation of training data

This section, which discusses steps one and two, provides the specifics of preprocessing, which covers how to extract motion features, process subtitles, and generate training examples for a given word.

4.1 Sign language features designing

In our framework, the region of interest (ROI) of each video frame is first identified and cropped, and then the openposeFootnote 1 library is used to detect the keypoints of signers. Figure 2 depicts the distribution of the detected keypoints of openpose. There are 60 keypoints in total, including 18 trunk keypoints and 42 hand keypoints (two hands). One of the most attractive advantages of openpose library is its powerful model migration ability that the available trained model can be directly used in our work without retuning. Overall, openpose is stable for the detection of joints of the body (including head, limbs, and torso), but the detection of finger joints is usually not good in the situation of low resolution, blurring, and blocking.

Fig. 2
figure 2

The output keypoints of openpose library

For sign language, only the keypoints of the upper body and hands are considered. Due to the disturbing factors of object blocking, target missing, perspective rotating, and background interfering, the position coordinates of the extracted keypoints should be filtered first to eliminate invalid values and outliers. In addition, the coordinates of the obtained keypoints are absolute values. To avoid interference from the height, distance, position, viewpoint, etc., the coordinates need to be resized with the formula below. All coordinates are calculated relative to the chest keypoint; then, coordinates will be normalized with the vertical distance between the keypoints of the nose and chest.

$${\hat{\varvec{P}}}_{i} = (\varvec{P}_{i} - \varvec{P}_{1} )\frac{{||\varvec{P}_{i} - {\text{ }}\varvec{P}_{{\mathbf{1}}} ||_{2} }}{{|P_{1}^{y} - P_{0}^{y} |}}$$
(4)

where \(\varvec{P_i} = (P_i^x, P_i^y)\) is the 2D position coordinate of the keypoint i, and the keypoints \(\varvec{0}\) and \(\varvec{1}\) represent the keypoints of chest and nose respectively.

4.2 Subtitle processing

The supervised information of sign language-interpreted videos is available in three forms: stand-alone subtitle files (e.g. vtt, srt format files), embedded subtitles, and interpretation soundtracks. The latter two can be transformed into stand-alone subtitle files through optical character recognition (OCR) and speech recognition, respectively. This supervised information is both weak and noisy. It is weak since the temporal distance between sign and subtitle is unknown and the act of signing does not follow the text order. It is noisy because subtitles can be signed in different ways, and the occurrence of a subtitle word does not imply the presence of the corresponding sign [45].

In an ideal situation, each word in subtitles may correspond to a video clip, and all words in the subtitles should be collected to construct a corpus. However, more purification operations should be executed to make the corpus more reasonable. First of all, the stemming and lemmatization methods should be used to remove the inflections (e.g. “s”, “es”, “ed”, “ing”) of each word in the corpus, and the lemma form words are obtained. Second, low-frequency words will be removed from the corpus. Third, words with unclear meanings will be removed from the corpus as stop words. Finally, to increase the one-to-one correspondence between words and signs, the polysemous words (one word has multiple meanings) will also be removed.

4.3 Training samples generation

For a given word in the resulted corpus, all subtitle frames [as Definition 6] whose short text contains the given word are defined as the positive frame. Then, the corresponding video clips are defined as the positive samples. Conversely, the negative samples are defined as the video clips that will not overlap with the positive samples of the given word and its synonyms. In general, the candidate time window of a positive sample needs to be widened to ensure the latent target sign can be enclosed. The usual procedure is to extend the time window ahead and backward by one subtitle frame, denoted as:

$$\begin{aligned} pos\_win = \left[ f_{t-1}^\mathrm{begin}, f_{t+1}^\mathrm{end}\right] \end{aligned}$$

where \(f_t\) represents the ith subtitle frame, while the superscripts begin and end indicate the timestamp. However, there is a delay in signing versus speaking for most real-time sign language translation videos. When the time span of the following subtitle frame is too short, it is not guaranteed that the extended time window can enclose the latent target sign.

In this work, an adaptive extension method is proposed to determine the candidate time window: the time window should contain the time span of the preceding, current, and next k subtitle frames, where k is the minimum positive integer that satisfies the following condition:

$$\begin{aligned} f_{t+k}^\mathrm{end}-f_t^\mathrm{end}\ge \beta \cdot \left( f_t^\mathrm{end}-f_t^\mathrm{begin}\right) \end{aligned}$$
(5)

The above formula states that the extended time along the time increase cannot be less than the \(\beta\) times the time span of the positive subtitle frame. In most cases, \(\beta\) should be larger than 1.

5 Automatic sign extraction

This section will describe how to find target signs for a given word from its positive and negative samples. We anticipate that the target sign will appear in the majority of positive samples and would not occur in negative samples based on the definition of samples. Then, we can formulate the task naturally as a MIL problem. MIL generalizes the pattern recognition problem by making a significantly weaker assumption about the labelling information. Formally, there are two types of bags: positive \(\varvec{B_p}\) and negative \(\varvec{B_n}\). Each bag has an indefinite number of instances, \(\varvec{B}=\{\varvec{x_{0}}, \varvec{x_1}, \dots \}\). When a bag is positive, it means that at least one instance in the bag is positive. Conversely, for a negative bag, all the instances in the bag must be negative. And the goal is to learn an binary classifier for instances \(\varvec{x}\). In our task, the sample refers to the bag. The subsequence in a sample refers to the instance in a bag, and the target sign refers to the positive instance.

When the instance-level binary classifier is configured as a 1-NN classifier, the training of this classifier is essentially to learn a short sequence that optimally separates the two classes of samples under the subDis distances. Now the problem of target sign finding has been transferred to the problem of shapelet discovering.

5.1 Shapelet searching

In theory, the shapelet can be any sequence shorter than the shortest training sample, then the search space can be infinite. For simplicity, we generally assume that the shapelet is a subsequence of training samples, and then, the shapelet discovering became the subsequence searching. Now, for a given word, the training set is defined as:

$$\begin{aligned} {\mathbb {T}} = T_\mathrm{pos}^1, T_\mathrm{pos}^2,\dots , T_\mathrm{neg}^1, T_\mathrm{neg}^2 \dots \end{aligned}$$

where \(T_\mathrm{pos}^i\) represent the ith positive sample, and \(T_\mathrm{neg}^i\) is the ith negative sample. Each sample is a motion feature sequence.

5.1.1 Candidate subsequences

For subsequence searching, the candidate subsequence can be generated using a sliding window strategy. Lines 2–4 of Algorithm 1 show the generation process of all candidate subsequences with the length between \(l_\mathrm{min}\) and \(l_\mathrm{max}\), where \({\mathbb {S}}_T^l\) is the set of every subsequences in sample T with length l, as defined in Eq. 3. And the length range satisfies:

$$\begin{aligned} 0<l_\mathrm{min}\le l_\mathrm{max}\le \min (\vert T\vert ), \quad \forall T\in {\mathbb {T}} \end{aligned}$$
(6)

In fact, the search range of shapelet is small because only the reasonable lengths of real signs are taken into account. Additionally, only positive sample subsequences will be searched.

5.1.2 Score of subsequence

The subDis distance between a sample \(T \in {\mathbb {T}}\) and a subsequence Q is calculated as \(d = subDis(Q,T)\). Then, the training set \({\mathbb {T}}\) can be split into two subsets with a distance threshold \(d_{\sigma }\). A sample T will be added to subset \({\mathbb {T}}_1\) if its subDis value \(d \le d_\sigma\), otherwise, it will be added to subset \({\mathbb {T}}_2\). Finally, a score function is created to assess the discriminating capacity Q.

$$\begin{aligned} \text {Score}(Q, {\mathbb {T}})&= \max _{d_{\sigma }} \left( I({\mathbb {T}}) - I({\mathbb {T}}_1, {\mathbb {T}}_2)\right) \end{aligned}$$
(7)
$$\begin{aligned} \qquad&= \max _{d_{\sigma }} I({\mathbb {T}}) - \frac{\vert {\mathbb {T}}_1\vert }{\vert {\mathbb {T}}\vert }I({\mathbb {T}}_1) - \frac{\vert {\mathbb {T}}_2\vert }{\vert {\mathbb {T}}\vert }I({\mathbb {T}}_2)\end{aligned}$$
(8)
$$\begin{aligned} Q^*&= \mathop {\arg \max }_{Q} \text {Score}(Q,{\mathbb {T}}),\forall Q \in T, T\in {\mathbb {T}}_p. \end{aligned}$$
(9)

where \(I({\mathbb {T}}) = -\sum _c p(c)\log (p(c))\) is the information entropy function, and c is the label of each sample. In practice, all the subDis distances can be sorted by magnitude first, then the optimal threshold \(d_{\sigma }\) can be approximated by searching the midpoint of all adjacent distances, which allows for fast calculation and without affecting the score. Theoretically, Q has the highest score when \({\mathbb {T}}_1\), \({\mathbb {T}}_2\) are identical to \({\mathbb {T}}_p\), \({\mathbb {T}}_n\), respectively. At this point, the subsequence \(Q^*\) with the highest score is actually the shapelet of \({\mathbb {T}}\).

5.1.3 Shapelet searching

In summary, the whole process of shapelet searching is shown in Algorithm 1. The inputs of this algorithm are the training set \({\mathbb {T}}\) of a given word and the possible length range \((l_\mathrm{min}, l_\mathrm{max})\) of target sign. Length range is a hyper-parameter that is influenced by regions, signers, words, moods, video fps(frames per second), etc. In this paper, the length range is set as \((0.5*L_\mathrm{avg},\ 2*L_\mathrm{avg})\) empirically, where \(L_\mathrm{avg}\) is the average time span per word, is calculated as:

$$\begin{aligned} L_\mathrm{avg} = \frac{\sum _i (f^\mathrm{end}_i - f^\mathrm{begin}_i)}{\sum _i \vert f^\mathrm{text}_i\vert } \end{aligned}$$
(10)

where \(f^\mathrm{begin}_i, f^\mathrm{end}_i, f^\mathrm{text}_i\) refer to the three components of the ith subtitle frame, and \(\vert f^\mathrm{text}_i\vert\) is the word count of the short text.

figure a

Then, the score of each candidate subsequence will be calculated using Eq. 7, and the subsequence with the highest score is identified as the shapelet. The score function is the core part of the shapelet searching algorithm, and the calculation of subDis distance is very time-consuming. Tricks like early abandoning, reuse of computation, parallel computing, and candidate filtering were adopted to speed up the searching process [52,53,54,55]. Algorithm 2 illustrates a sliding distance computing technique with python–numpy style pseudo-code. Using matrix operations, all distances between subsequence S and each sliding subsequence of the sequence T can be calculated simultaneously. With sliding distances, the subDis distance can be obtained as:

$$\begin{aligned} subDis(S,T) = \min (SldDists(S,T)) \end{aligned}$$
(11)

In our experiment, the SldDists algorithm can improve the speed of shapelet searching by about 15–40 times compared with tricks of early abandon and entropy pruning.

figure b

5.1.4 Time complexity

For a given word, suppose there are 2n samples in the training set, and half of them are positive. If the average length of these samples is \({\bar{m}}\), then there are about \({\bar{m}}n\) candidate subsequences. For each candidate subsequence, 2n times of subDis calculations is needed, and the time complexity of the subDis function is \(O({\bar{m}}^2)\). In conclusion, the final time complexity of the shapelet searching algorithm is \(O({\bar{m}}^3n^2)\).

5.2 Distance calculation acceleration

The SldDists function has greatly sped up the calculation of subDis. However, the current speed is still too slow to construct a large-scale sign language database. Inspired by the matrix profile in works [62, 63], we find that the subDis computation of candidate subsequences with adjacent locations and lengths involves a significant amount of repeated calculations. In this part, we greatly reduce the number of repeated calculations in shapelet searching by drawing on the calculation tricks in work [63].

Given two sequences A and B of lengths \(m_1\) and \(m_2\), the goal is to compute the distances between all subsequence pairs of A and B, denoted as distance matrices:

$$\begin{aligned} \varvec{M} = \{M_{i,j}^l \vert \quad&0\le i \le m_1-l, \quad 0\le j\le m_2-l,\nonumber \\&1\le l\le \min (m_1, m_2) \} \end{aligned}$$
(12)

where \(M_{i,j}^l\) is the Euclidean square distance between subsequences \(A_{i:i+l}\) and \(B_{j:j+l}\).

$$\begin{aligned} M_{i,j}^l = \sum _{k=0}^{l-1} (A_{i+k}-B_{j+k})^2= \sum _{k=0}^{l-1} d_{i+k,j+k} \end{aligned}$$
(13)

5.2.1 Adjacent location subsequences

For the distance of the adjacent location subsequences: \(M_{i+1, j+1}^l\), we have the following decomposition:

$$\begin{aligned} M_{i+1,j+1}^l&= \sum _{k=0}^{l-1} d_{i+1+k,j+1+k} = \sum _{k=1}^l d_{i+k,j+k}\nonumber \\&= M_{i,j}^l + d_{i+l,j+l} - d_{i,j} \end{aligned}$$
(14)

which means that we can calculate the distance of two subsequences based on the distance of its preceding subsequences, and we have a iterative update formula:

$$\begin{aligned} \varvec{M}^l[i+1,1:] = \varvec{M}^l[i,:-\,1] + \varvec{d}[i+l,l:] - \varvec{d}[i, :-\,l] \end{aligned}$$
(15)

where \(\varvec{d}\) is the Euclidean square distance matrix between each point in sequence A and each point in B. With the iterative formula 15, we only need to calculate \(\varvec{M}^l[0,:]\), \(\varvec{M}^l[:,0]\), and \(\varvec{d}\) in advance. Then, the rest of distance matrix \(\varvec{M}^l\) can be calculated with just matrix addition. In fact, \(\varvec{M}^l[0,:]\) and \(\varvec{M}^l[:,0]\) are sliding distances as described in Algorithm 2, and the \(\varvec{d}\) can also be seen as a special case of the sliding distance (the length of subsequence is 1). So they all can be accelerated using matrix operations as follow:

$$\begin{aligned} \varvec{d}[i,:]&= \sum _{c}(A[i] - B)^2\end{aligned}$$
(16)
$$\begin{aligned} \varvec{M}^l[0,:]&= \sum _{k=0}^{l-1}\varvec{d}[0,k:k+m_2-l+1]\end{aligned}$$
(17)
$$\begin{aligned} \varvec{M}^l[:,0]&= \sum _{k=0}^{l-1}\varvec{d}[k:k+m_1-l+1,0] \end{aligned}$$
(18)

where c in Eq. 16 represents the dimension number of features, and Eq. 16 does the same work as the line 5 in Algorithm 2.

5.2.2 Adjacent length subsequences

Next, let us look at the distance of the subsequences with adjacent length. When we change the length of subsequences to \(l+1\), the distance between two subsequences can be written as:

$$\begin{aligned} M_{i,j}^{l+1} = \sum _{k=0}^{l}d_{i+k,j+k}=M_{i,j}^l + d_{i+l,j+l} \end{aligned}$$
(19)

In this way, we obtain the iterative formula of the distance matrix \(\varvec{M}\) with the adjacent subsequence length:

$$\begin{aligned} \varvec{M}^{l+1} = \varvec{M}^l[:\text {-}1,:\text {-}1] + \varvec{d}[l:,l:] \end{aligned}$$
(20)

Combing the distance matrix updating Formulas 15 and 20, we can get the algorithm for calculating the distances between all possible subsequences of two sequences as shown in Algorithm 3. For a distance matrix \(\varvec{M}^l\), each row of data represents the sliding distances between a subsequence of A and the sequence B. Then, the minimum value of the row of data represents the subDis between the subsequence and B.

$$\begin{aligned}&\varvec{M}^l[i] = SldDists(A[i:i+l], B)\end{aligned}$$
(21)
$$\begin{aligned}&\min (\varvec{M}^l[i]) = subDis(A[i:i+l], B). \end{aligned}$$
(22)
figure c

5.2.3 Improved shapelet searching

We can efficiently obtain the subDis distances between all subsequences of two sequences by employing iterative updating of the adjacent distances. In shapelet searching, to calculate the score for each candidate subsequence, the Distance Matrices between each positive sample and all samples should be computed in advance.

$$\begin{aligned} DistanceMatrices(T_a, T_b),\quad \forall \ T_a\in {\mathbb {T}}_p, T_b\in {\mathbb {T}} \end{aligned}$$

In fact, rather than iterating over all sample pairs as described above, a more efficient method is to concatenate all positive samples as a single sequence \(T_A\) and all training samples (both positive and negative) as a single sequence \(T_B\). Then directly input \(T_A\) and \(T_B\) to the algorithm DistanceMatrices, and the distance matrix of one pair \(<T_a,T_b>\) is the sub-matrix of the distance matrix of \(<T_A, T_B>\) and can be directly obtained by slicing.

However, there are two practical limitations to this concatenation method. First, as shown in Algorithm 3, at least two matrices \(\varvec{d}\) and \(\varvec{M}\) need to be constructed. It was supposed that we have 200 training samples (100 positives) with an average length of 400 and the storage data type is float32. Then, more than 23GB memory usage is needed, and the size will grow exponentially with the number of samples. Second, for the iteration formula 15, the matrix addition will accumulate the error of each iteration. The error can be remarkable when the length of the sequence \(T_A\) is too large.

Finally, as shown in Algorithm 4, we get an improved shapelet searching algorithm based on distance matrices. In the algorithm, we first obtain the distance matrices \({\mathcal {M}}=\{\varvec{M}^l\vert l\in [l_\mathrm{min},l_\mathrm{max}]\}\) between each positive sample \(T_a\) and concatenated training sequence \(T_B\). Now, the \(r_{th}\) row of a distance matrix \(\varvec{M}^l\) represents the sliding distances between a candidate subsequence \(T_{a}^i[r:r+l-1]\) and \(T_B\). After that, by slicing \(\varvec{M}^l[r]\) in accordance with where the training sample \(T_b\) is located in \(T_B\), the sliding distances between the subsequence and \(T_b\) is determined, and the minimum of sliding distances is subDis. Once we obtained its subDis distances with all the training samples, we could use Eq. 7 to calculate the score of the subsequence. Finally, the subsequence with the largest score will be returned as shapelet. If there are two subsequences with the same score, the one with the smaller standard deviation (std) of subDis distances will be chosen.

figure d

5.3 Shapelet net

In contrast to the above brute force searching strategy, another option of shapelet discovering is to remove the restriction that only searches the shapelet from all candidate subsequences, treating the shapelet as a sequence with unknown parameters to learn. Paper [57] is one of the original papers to propose this idea, and this subsection provides a detailed introduction to the idea. When evaluating a shapelet Q as an unknown sequence, we should first calculate the subDis distances between Q and all training samples, and then determine the linear separation with maximum information gain using the subDis distances.

For simplicity, the subDis distance between Q and a training sample T is denoted as X, and then its expression (Eq. 2) is expanded in the following.

$$\begin{aligned}&\qquad X = \min _{j}Dis(Q,T_{j:j+l})\end{aligned}$$
(23)
$$\begin{aligned}&Dis(Q,T_{j:j+l})=\sum _{k=0}^{l-1}(Q_k-T_{j+k})^2\end{aligned}$$
(24)
$$\begin{aligned}&\quad =\sum _{k=0}^{l-1}Q_k^2+\sum _{k=0}^{l-1}T_{j+k}^2-2\sum _{k=0}^{l-1}Q_kT_{j+k}\end{aligned}$$
(25)
$$\begin{aligned}&\quad =Q\cdot Q + I\cdot T_{j:j+l}^2 - 2Q\cdot T_{j:j+l} \end{aligned}$$
(26)

where the dot operate indicates the inner product, and the square \((\cdot )^2\) is a point-wise operation.

When j is varying from 0 to \(\vert T\vert -l\), the term \(Q\cdot Q\) is irrelevant to j and will remain constant. The terms \(I\cdot T_{j:j+l}^2\) and \(Q\cdot T_{j:j+l}\) will become the convolution operations of \(I\otimes T^2\) and \(Q\otimes T\). Therefore, we have a new calculation formula of the sliding distances:

$$\begin{aligned} SldDist(Q,T) = Q^TQ + I\otimes T^2 - 2Q\otimes T \end{aligned}$$
(27)

And if the linear separation of the training set can be represented by a linear binary classifier, then the classification result for sample T can be written as:

$$\begin{aligned} {\hat{Y}} = \omega X + \omega _0 \end{aligned}$$
(28)

where \({\hat{Y}}\) is the predicted target of positive and negative, the \((\omega , \omega _0)\) are the weights. Then, the logistics loss can be used to evaluate the performance of the classifier:

$$\begin{aligned} {\mathcal {L}}({Y},{{\hat{Y}}}) = -{Y}\ln \delta ({{\hat{Y}}})-(1-{Y})\ln (1-\delta ({{\hat{Y}}})) \end{aligned}$$
(29)

where \(\delta ({\hat{Y}})=(1+e^{-{\hat{Y}}})^{-1}\).

Up to now, the above process of subDis computation, sample classification can be represented as a Shapelet Net shown in Fig. 3. The shapelet kernel and the binary classifier are the unknown parameters to be learnt. When a sample T is entered, a predicted label is returned. It needs to be noted that the term \(Q^TQ\) in Eq. 27 is ignored because it has no impact on how the samples are classified. Then, with a training set \({\mathbb {T}}\) and its labels \(\varvec{Y}=[Y_1, \dots , Y_{\vert {\mathbb {T}}\vert }]\), the optimal shapelet Q and linear hyper-plane \(\varvec{\omega }\) can be learned by minimizing a the regularized objective function, written as \({\mathcal {F}}\).

$$\begin{aligned} {\mathcal {F}}(Q,\varvec{\omega }) = \sum _{i=1}^{\vert {\mathbb {T}}\vert }{\mathcal {L}}({Y_i},{\hat{Y_i}}) + \lambda \Vert \varvec{\omega }\Vert ^2 \end{aligned}$$
(30)

There are two issues that arise during the training of shapelet Net. First, the \(\min (\cdot )\) function in Eq. 23 is not derivable, which makes it impossible to use gradient descent method in the training. This can be solved by replacing the minimum function with a derivative softmin function as follows:

$$\begin{aligned} softmin(\varvec{x}) = \frac{\sum _i x_ie^{\alpha x_i}}{\sum _j e^{\alpha x_i}} \end{aligned}$$
(31)

where \(\alpha <0\) is the parameter that controls the precision of the softmin, and when \(\alpha \rightarrow -\infty\), the softmin is nearly equal to the true minimum function.

Second, the objective function given by Eq. 30 is not a convex function, and the gradient-based optimization converges usually to a local minimum. To get a reasonable suboptimal result, the learning parameters must be initialized well. The shapelet parameters in the works [57, 59] were initialized with clustering centres of candidate subsequences. However, since our work only involves one shapelet, employing cluster centres for initialization makes it simple to converge to a non-target sign result. Therefore, we finally choose to use the shapelet searched from a small subset of training samples by Algorithm 1 as our initial parameters.

Compared with shapelet searching, shapelet Net no longer discovers shapelets by brute force searching but learns the shapelet as unknown parameters through network training. The shapelet Net brings two changes: the first is the computational complexity. Shapelet Net has the complexity of \(O({\bar{m}}^2n\times Ite_\mathrm{max})\) [57] instead of the \(O({\bar{m}}^3n^2)\) of shapelet searching. In general, \(Ite_\mathrm{max}<<{\bar{m}}n\). The second is that, rather than being a true candidate subsequence, the learned shapelet is more akin to the generalization of the best candidates. which makes it theoretically more representative than the searched shapelet.

Fig. 3
figure 3

The structure of the shapelet Net

5.4 Sample augmentation

The target sign for a given word can be assumed to be the shapelet that was acquired through shapelet searching or shapelet Net. With the shapelet, more instances of the target sign can be gathered by collecting the subsequence at the shortest distance with the shapelet in each positive sample. However, the subtitles are weak and noisy as introduced in Sect. 4.2. There are uncertain differences in time displacement, occurrence, and order between subtitles and signing. For a given word, its positive sample does not guarantee that it contains a target sign, and its negative sample may also contain a target sign. Thus, two tasks need to be done: removing fake target signs from positive samples and identifying latent target signs from negative samples.

For a shapelet Q, the target signs from all positive samples can be identified using the following formula:

$$\begin{aligned} S_i = \mathop {\arg \min }_S Dis(Q,S), \quad \forall S \in {\mathbb {S}}^{\vert Q \vert }_{T_{pos}^i} \end{aligned}$$
(32)

According to the paper [45], only \(67\%\) of their positive samples contain true target signs. In our data resource, the true rate of positive samples is even less than \(60\%\). Now, we set the true rate as f, and only the proportion f of the most similar target signs to the shapelet are considered to be true target sign. Without losing generality, we assume that the identified target signs from all positive samples were sorted in increasing order based on their distance from Q. Then, the chosen true target signs are represented as: \(S_1, S_2, \ldots , S_{n_f}\) satisfying:

$$\begin{aligned} \Vert Q-S_1\Vert \le \Vert Q-S_2\Vert&\le \dots \le \Vert Q-S_{n_f}\Vert \end{aligned}$$
(33)

where \(n_f = \lfloor f \times \vert {\mathbb {T}}_p\vert \rfloor\), and \(\lfloor \cdot \rfloor\) is the floor round function. Then, the threshold \(\tau =\Vert Q-S_{n_f}\Vert\) can used to judge whether a target sign is true or not:

$$\begin{aligned} S' = \left\{ \begin{aligned} \text {target sign}\qquad&\Vert Q - S'\Vert \le \tau \\ None \qquad&\Vert Q - S'\Vert > \tau \end{aligned} \right. \end{aligned}$$
(34)

When the number of positive samples is small, however, \(\tau\) has a limited number of candidate values and is not sensitive to variations in f. An additional factor \(\theta\) is proposed to allow for finer and smoother adjustment of \(\tau\). First we denote \(d_\mathrm{min} = \Vert S-S_1\Vert\), \(d_\mathrm{max} = \Vert S-S_{n_f}\Vert\). If \(\Vert S-S_1\Vert = 0\), \(d_\mathrm{min}\) will be replaced with the smallest nonzero distance. Then, the distance threshold \(\tau\) is redefined as:

$$\begin{aligned} \tau = d_\mathrm{min} + \theta (d_\mathrm{max} - d_\mathrm{min}) \quad 0\le \theta \le 1 \end{aligned}$$
(35)

In order to obtain more instances, we propose a sample augmentation strategy that collects all subsequences satisfying Eq. 34 in all videos as target signs. Since we have at least 1000 words and over 96 h of sign language videos, an efficient sliding distance calculation is one of the cores of sample augmentation. Two sliding distance calculation methods have been proposed in our work: Algorithm 2 and formula 27. Formula 27 transforms the sliding distance into a combination of convolution operations.

The convolution theorem states that the convolution in the time domain equals the point-wise multiplication in the frequency domain. Due to the fact that point-wise multiplication is much faster than convolution, we adopt the SlidingDotProduct algorithm proposed in the paper [62] to calculate the convolution operations. The details of the algorithm are shown in Algorithm 5, which transforms between time and frequency domains by forward and inverse Fast Fourier transforms(FFT and IFFT). Before performing the FFT, the input sequences must be padded and reversed to ensure the following convolution is essentially produced in the right order. The time complexity of FFT and IFFT is \(O(n\log n)\), and it can be calculated at a very high speed with many available mathematical libraries. According to our testing, the sliding distance calculation based on Eq. 27 plus Algorithm 5 is faster than Algorithm 2 by 4–5 times, whether running on CPUs or GPUs.

figure e

Algorithm 6 shows the details of our final sample augmentation strategy. For a given word, all the subsequences with a distance less than \(\tau\) from the shapelet in all video sequences are collected as instances of the target sign. The distance threshold \(\tau\) is determined by Eq. 35. And the sliding distances between the shapelet and a long video sequence are calculated by combining Eq. 27 and Algorithm 5. However, it is important to note that the surrounding subsequences of a target sign subsequence may also be very similar to the shapelet. Therefore, in lines 10–17 of the algorithm, we need to make sure the collected instances have a lower distance from the shapelet than their neighbours.

figure f

6 Experiments

Two main experiments will be evaluated in this section. First, we will show how the proposed framework can automatically construct a large-scale word-level sign language database from sign language-interpreted videos. The experimental environments and parameter settings will be described in detail, and the results of two shapelet methods (shapelet searching and shapelet Net) will also be compared and discussed. Second, three cutting-edge SLR methods are utilized to evaluate the database in order to demonstrate the practicability of our created database. Additionally, the sample augmentation strategy and the effects of various shapelet methods will be examined.

6.1 Original data source

In order to avoid the impact of language differences in different regions and fields, we collected the sign language videos with similar occasions and themes. Eighty-nine Scottish Parliament’s live sign language translating videos are downloaded from YouTube [64]. These real-time sign language translations are supported by The Scottish Parliament’s BSL Plan 2018–2024.Footnote 2 Our downloaded 89 videos are interpreted by 8 native British sign language (BSL) signers and have a total video duration of about 96 h. The original downloaded sign language videos had no subtitles, and then, we generated individual subtitle files for these videos using the English speech recognition tool provided by YouTube. There are about 32,000 meaningful and meaningless words in all these subtitle files.

After deleting the meaningless words, stop words, polysemy words, and low-frequency words, the remaining words will be processed with stemming and lemmatization methods. Finally, we chose 1000 words with good properties to form the final corpus of our target database. The Fig. 4 demonstrates the frequency distribution of these words. In the figure, the words appear roughly 10 to 800 times in the subtitles (the minimum frequency is set to 10). However, the frequency distribution shows a significant imbalance: 80% of the words have fewer than 213 occurrences and 95% of the words have fewer than 553 occurrences. The average frequency of all words is 210, while the average frequency of the least 80% of words is only 67.

Fig. 4
figure 4

The frequency distribution of the final corpus

6.2 Motion feature extraction

The skeleton keypoints of the signers in all sign language-interpreted videos were extracted using the openpose library. The provided models (body and hand) that trained with CMU Panoptic DatasetFootnote 3 can be directly applied to new videos without parameters retuning. In our work, the models were loaded by the pytorch and evaluated on GPUs. The body keypoints were first extracted from the cropped regions containing the signers. Then, the hands keypoints were extracted from the regions that determined by the extracted body points (i.e. shoulder, elbow, wrist) [36]. It took us about a week to extract the skeleton keypoints of all 96 h of sign language video using a single GeForce-RTX-3090 GPU.

In our work, only the keypoints of the upper body and hands were selected to construct the motion feature using Eq. 4. Before the feature construction, the coordinates of the chosen keypoints will be filtered using a median filter to remove outliers. The invalid coordinates (i.e. the position of non-detected keypoints) will remain unchanged as (0, 0). In addition, our work took into account the effects of left-handedness and right-handedness, and for each motion feature subsequence, an additional x-mirror of the motion feature subsequence (multiplying all x coordinates by \(-1\)) will be provided.

6.3 Shapelets learning

As Sect. 4.3, the positive and negative samples were generated for each word in the final corpus. The candidate windows were determined using the Eq. 5, and the parameter \(\beta\) was set to 1.5 by trial. In this case, most of the ground truths of the target signs can be contained in the positive sequences.

6.3.1 Ground truths

To be able to evaluate the shapelet methods, the ground truths of the target signs should be provided. In this paper, we selected 99 words and then annotated 10 to 30 positive samples of each word. To make the annotation more appropriate, most of the selected words should have a clear meaning, and the corresponding examples should be available on the Internet. If a positive sample contains the target sign of the word, the sample is called a true positive sample and the indexes of the begin and end frames of the sign were recorded. Otherwise, the sample is called a false positive sample. From the annotations of these 99 words, we can deduce that the average truth rate of these positive samples is about \(60\%\), which means that about \(40\%\) of the positive samples do not contain a target sign. And the time span of the sign is typically 10 to 40 frames.

6.3.2 Speed test

With the training samples of a given word, the shapelet can be discovered with two shapelet methods (Searching and Net). In order to show the effectiveness of the distance calculation acceleration discussed in Sect. 5.2 and compare the speed of different available shapelet methods, Fig. 5 illustrates the time consumption of different shapelet discovering methods with different sample numbers.

Five shapelet methods are compared in Fig. 5, including three shapelet searching methods and two shapelet Net methods. The search-ori is the baseline shapelet searching algorithm proposed in the paper [51], which uses a two-level pruning strategy to abort the hopeless candidates early. The search-idp is the algorithm proposed in the paper [55] that uses sampling and filtering strategies to greatly reduce the number of candidate shapelets. However, since the labels of our time series are weak and noisy, and the length and content are also varied a lot, the sampling strategy is not used here, only the filtering strategy is adopted. In addition, our proposed algorithm 2 is also used to speed up the calculation of the search-idp. The search-dm is the improved shapelet searching algorithm 4, which significantly reduces the repetition in the calculation of subDis distances by the proposed distances matrices. net-mean and net-dm are the proposed shapelet learning algorithm with two different parameter initialization methods. The net-mean uses the average values of all the candidate shapelets for initialization [57, 59], while the net-dm uses the shapelet obtained by search-dm as the initial shapelet. In the test, an equal number of positive and negative samples were chosen randomly. We fixed five candidate lengths for the shapelet: 15, 20, 25, 30, 35. Meanwhile, the length range of search-idp is also set to [15, 35]. In addition, the time consumption of the net-dm method involves the time consumed for the shapelet initialization with \(10\%\) samples.

From the experimental results in Fig. 5, we can find that compared to the baseline method search-ori, the other four shapelet discovering methods can greatly reduce the calculation time, and the net-dm has the highest learning speed. Among them, the time consumption of methods search-ori, search-idp and search-dm are roughly increasing squarely, which is consistent with the idea of brute force searching. The two net methods have similar learning speeds, and their time consumption increases roughly linearly. The fact that net-dm is faster than net-mean is probably due to the convolutional network can converge faster with a more reasonable initialization. And the reason why search-dm is not as efficient as method net is that although search-dm is very efficient in computing sliding distances (the time consumption of a new distance matrix calculation based on Eq. 20 is almost negligible), the time consumption of both the initialization of distance matrix and the calculation of score also increases exponentially with the number of training samples.

Fig. 5
figure 5

The time-consuming comparison

6.3.3 Recall rate test

The performance of the four shapelet methods (search-dm, search-idp, net-mean, net-dm) on the annotated set of 99 words was evaluated in this experiment. The maximum number of training samples for the search- and net- methods is set to 100 and 600, respectively, to allow for the learning of large-scale shapelets to be completed in a reasonable amount of time. Moreover, according to Eq. 10, we set the final shapelet length range to \((10,\ 40)\). And the commonly used metric recall rate [38, 45, 46] was adopted to measure the correctness of discovered shapelets. It is considered to be identified correctly if the overlap rate between an identified sign and the ground truth is more than \(50\%\). Similarly, a word is considered to be correctly recalled if more than \(50\%\) of the target signs are correctly identified. And the recall rate indicates the proportion of the correctly recalled words.

The overlap rate, identification rate, and recall rate of the four shapelet methods are compared in Tables 1 and 2 under different conditions. The best results for each rate under specific conditions are bolded for intuitive method comparison. Firstly, we can find that the true ratio and the number of positive samples have a significant impact on the learning results. The higher true ratio means the higher rate of overlap, identification, and recall. This is actually predictable because the smaller the true ratio is, the more the shapelet discovering deviates from the assumptions of the MIL problem. At the same time, the larger the number of training samples, the higher the rate of overlap, identification, and recall, which indicates that more training samples can effectively enhance the ability of shapelet discovering methods to learn true target signs.

Furthermore, we discovered that the net-mean method produces the worst learning results, demonstrating two facts: first, a good initialization of the shapelet net is critical, and second, the strategy of mean-valued initialization works poorly in our task. Overall, the search-idp method is worse than the search-dm because it will filter out good candidate shapelets compared to the all-search strategy of search-dm. However, we also find that search-idp performs best in the case of \(\text {true ratio} < 0.5\), which is probably because the filtering strategy can also remove a lot of interference candidates under noisier conditions. The net-dm method performs best under nearly all conditions. This is probably because the net-dm is initialized with searched shapelets, which is actually equivalent to a two-stage learning model. The first stage is the rough learning process using search-dm on a small subset of samples, while the second stage removes the restrictions on the shapelet values and fine-tunes the rough shapelets.

Based on the results of the above recall rate test, only the search-dm and net-dm methods will be used in the following database construction, and the terms of search and Net in the subsequent section refer to search-dm and net-dm.

Table 1 The average overlap rate and identification rate
Table 2 The recall rate

6.4 Database construction

6.4.1 Signs collection

In this section, we applied two best shapelet methods (i.e. search-dm and Net) to the whole corpus with the same parameter settings as in the above subsection. Then, the discovered shapelets will be used to collect the instances of target signs.

Figure 6 shows the process of signs collection for a given word “balance”. Given a discovered shapelet, the target signs are identified from the positive samples first. Here, we believe that the target signs closer to the shapelet (with green background) have higher confidence of identification, and they will be used as the criterion to judge whether a subsequence similar to the shapelet or not. Finally, the additional similar subsequences will be collected as augmented signs. Figure 6a, b show the signs collection process based on shapelet searching and shapelet Net, respectively. The only difference between them is the shapelets. As shown in figure, the shapelet discovered by shapelet searching is an actual existing motion clip, while the shapelet learned by shapelet Net is a parameterized pose sequence without a corresponding actual motion clip.

Fig. 6
figure 6

The signs collection illustration of the word “balance”

6.4.2 Four subsets

Finally, we constructed a Scottish Parliament British Sign Language database, SPBSL, which includes four large-scale sign language subsets based on different shapelet methods and sample strategy. That is: no_aug-search, no_aug-net, aug-search, aug-net, where search and net correspond to two shapelet methods, while no_aug and aug indicate whether the samples contain the augmented signs or not. The databases, models, and code are available at our project page.Footnote 4

The construction strategy of the no_aug sign language databases is to only collect basic target signs (localized from the positive samples) as samples. This strategy has been adopted in many studies [38, 45, 46]. Based on our analysis and experiments on the annotated words, we know that the true ratio of positive samples is about 0.6, while the proportion of true positive samples being correctly localized (i.e. the identification rate) is only about 0.5. As a result, when building the non-augmented sign language database, we only keep the f ratio of nearest target signs as samples according to Eq. 33. With the test tuning, there is a better balance between the number and confidence of the non-augmented sign language database when f is set to 0.7.

The augmented sign language databases are constructed by collecting similar instances (the signs with green backgrounds in Fig. 6) of the shapelet from the whole data source as samples. According to Algorithm 6, an instance is considered a sample if its distance from the shapelet is less than the threshold \(\tau\) and it has the shortest distance among its neighbours. The two-level conditioning function Eq. 35 determines the threshold \(\tau\), and the parameters f and \(\theta\) are both set to 0.3 in our construction.

Figure 7 illustrates the sample distributions of these four sign language databases. Since the number of samples in the non-augmented database is completely determined by the number of positive samples and the parameter f, the sample distributions of the non-augmented databases based on two shapelet methods (no_aug-search and no_aug-net) are identical and are represented commonly as Fig. 7a. The sample distributions of the two augmented databases are shown in Fig. 7b, c. As shown in the figure, the sample augmentation strategy can significantly increase the number of samples. Compared with the original database, the aug-search and aug-net databases showed a 3.2 and 2.8 times increase in the mean number of samples, and a 5.3 and 5 times increase in the mean number of samples of the \(80\%\) smallest-sized classes, respectively. In summary, sample augmentation can both increase the number of samples and improve the balance of the database (the distribution is flatter).

Fig. 7
figure 7

The occurrence distributions of four constructed databases

6.4.3 Test sets

To evaluate the four constructed sign language databases and compare the impact of different shapelet methods and sample strategies, we will give the construction method of the corresponding test set here: The test set is designed as a collection of target signs that satisfy the condition of Eq. 35, where both factors f and \(\theta\) are also set to 0.3. First, the smaller of the f makes the test set have higher identification confidence. Second, as a common subset of the augmented and non-augmented datasets, the test set can be used to evaluate the sample augmentation strategy. In our experiments, two test sets were constructed corresponding to two shapelet methods.

6.5 Database evaluation

6.5.1 Compared methods

In this work, we chose three state-of-the-art SLR methods to evaluate the four subsets of SPBSL. They are I3D [11], GRU [9], and TGCN [10]. Their structures are shown in Fig. 8.

Fig. 8
figure 8

The structures of three state-of-the-art sign language classifiers

I3D is a spatio-temporal CNN architecture that takes a multiple-frame video as input and outputs class probabilities over sign categories. We adopt the I3D architecture due to its success in action recognition benchmarks. The original I3D network was trained on ImageNet [65] and fine-tuned on Kinetics-400 [11]. To apply I3D to our SLR task, a common approach is to treat the original trained model as a pre-training model, then modify the final linear classification layer to match the number of classes. Finally, the model is further fine-tuned using the training data.

GRU is a classical architecture for motion recognition that has good modelling ability for sequence data. In this paper, the designed motion features were concatenated and then fed to a stacked GRU of 2 layers. In practice, we can adjust the fitting ability of the model by changing the hidden layer size.

TGCN is an architecture proposed in the paper [10] that stacks multiple residual graph convolution blocks and takes the average pooling result along the temporal dimension as the feature representation of pose trajectories. In the first layer, the networks take as input the \(K \times 2T\) matrix coordinates of body keypoints, where K is the number of keypoints and T is the number of frames per sample. As above, the coordinates of the keypoints also need to be processed by Eq. 4. Then, a softmax layer followed by the average pooling layer is employed for classification.

6.5.2 Implementation details

The above three recognition models are implemented using pytorch. Our experiment settings basically follow the training configurations in the paper [10]. For the input video frames of I3D, the original video frames will be resized to \(256\times 256\) first, then we randomly crop a \(224\times 224\) patch from an input frames and apply horizontal flipping with a probability of 0.5. For the hidden layer size of the GRU, it is empirically set to 64 when the number of classes is less than 500, and 128 when it is greater than (equal to) 500. And the same setting is used for the hidden layer size of TGCN, and the stacking number of residual graph convolution blocks in TGCN is set to 24.

According to the average length of our manually annotated sign samples, the time length of the input samples to all three models is set to 20. When the frame number of an original sample is greater than 20, input frames are chosen using uniform sampling. When the frame number of an original sample is less than 20, the input frames will be padded forward or backwards randomly using boundary frames. Finally, the Adam optimizer is chosen to minimize the cross-entropy for all three models.

Before the database evaluation, different sized (100, 300, 500, 1000) subsets of the four sign language datasets were constructed first, and then, each subset was evaluated using three recognition models (i.e. I3D, GRU, and TGCN). In order to keep the number of samples balanced among different classes, the maximum number of samples for each class is set at 200. In the training, the ratio of the number of training and validation samples is set to 3:1. The training process will be terminated when the performance of the validation samples no longer increases. And the two test sets designed in Sect. 6.4.3 were used to evaluate the performance of trained models. We use top-k classification accuracy with \(k=\{1,5,10\}\) to evaluate the performance of the models on databases. In SLR, many misclassifications can be corrected with contextual knowledge. Therefore, it is more reasonable to choose top-k accuracy to evaluate the word-level SLR.

7 Discussion

7.1 Impacts of video parameters

The impacts of video parameters on database construction are qualitatively discussed in this subsection. In general, the higher the quality of the video resources, the higher the quality of the constructed database. In general, high quality video requires a high frame rate, high spatial and contrast resolution, etc. Fast and little movements can be captured well with a high frame rate, while more details (e.g. face, handshape) can be clarified with a high spatial resolution. Additionally, a high contrast resolution makes it easier to identify the ROI. In our study, first, high quality videos can assist the extraction of motion features with high clarity. Then, the higher confidence shapelet can be discovered with more accurate features and thus influence the constructed database based on the shapelet. Furthermore, the database is essentially a collection of video clips, and good video parameters can also make it easier for various classifiers to recognize them. In the end, it is worth noting that the input images for pose extractors and sign recognizers will be resized to a small fixed size (e.g. \(256 \times 256\)). Therefore, it is unnecessary to seek a very high spatial resolution, and the input images typically need to be cropped to ensure a high ROI ratio.

7.2 Performance evaluation of networks

The recognition accuracy of the three models on all database subsets is shown in Table 3. The highest recognition accuracies for different size databases under each classifier are bolded to represent the impact of database size. Overall, models I3D and TGCN have better performance than GRU, which is consistent with the experimental results of the paper [10]. However, what is not inconsistent is that the classification accuracy of I3D is lower than TGCN, and it performs even worse than GRU on non-augmented databases with class numbers below 300. We argue for two reasons behind this inconsistent phenomenon. First, since I3D is more complex and larger than the other two models, it tends to be overfitting when the number of training samples is small. For the non-augmented databases, the deficit of samples makes I3D poor and unstable (the classification accuracy of I3D on the non-augmented databases of 500 classes is higher than that on the non-augmented databases of 300 and 100 classes). And the better and more stable performance of I3D on the augmented database also revealed that the deficiency of training samples of the non-augmented database is the main reason for the poor performance of I3D. Secondly, since our databases are constructed based on shapelets that learned with extracted pose features, the image-based I3D does not perform as well as the pose-based TGCN.

7.3 Effects of shapelets and sample augmentation

As shown in Table 3, the performance of the databases based on different shapelet methods is similar for the three models. The evaluation results are almost the same for the augmented databases. Overall, the classification accuracy of the net-based databases is slightly higher than that of the search-based databases. Compared to the slight impact of different shapelet methods, the sample augmentation strategy improves the recognition accuracy of the models on the databases significantly (about 20% on average), and the top-1 accuracy of GRU, TGCN, and I3D on the augmented databases can be improved by 33%, 10%, and 28% on average.

Table 3 Top-1, top-5, top-10 accuracy (%) achieved by each model (by column) on subsets of SPBSL

8 Conclusion

We propose a novel framework based on shapelet that can build a large-scale word-level sign language database automatically from online sign language-interpreted videos. Two modified shapelet methods show that they can identify the target sign from the weak and noisy supervision information in a shorter time. The shapelet net method that is initialized using a roughly searched shapelet shows superiority in speed and performance. Then, we construct a 1000 words Britten sign language database, SPBSL, which contains four subsets based on different shapelet methods and different sample strategies. Finally, we evaluated the three state-of-the-art network architectures and approaches as the baselines on our database and demonstrated promising results of our proposed framework.

For future works, in response to the current low recall rate, a judging method should be proposed to remove invalid words and signs. In addition, a more robust and reasonable sign language feature descriptor needs to be designed to hold all kinds of variants, and an appropriate time-scale scaling needs to be introduced to make the distance between motions more accurate.