1 Introduction

Machine learning technology has progressively extended its application domain to reach Machine Translation (MT) field. It has been improved by the emergence of effective statistical methods such as phrase-based MT (statistical MT). However, phrase-based MT is considered as a powerful statistical methods based on automatic train systems from a very large translated text sources (called parallel corpora). It relies on heuristics for an unsupervised high quality word alignments based on IBM models (IBM Model 1, 2, 3, 4) and word frequency. The effectiveness of this system depends on the quality of the alignment and on the size of the parallel training corpora. In other words, the performance of statistical based MT is closely related to the availability of very large parallel corpora. This is not the case for sign language data.

Although the quality and the availability of sign language (SL) corpora have been improved greatly in the past few years Neidle and Sclaroff (2002), Efthimiou (2007). The majority of existing sign language corpora are focused on video annotation such as the NCSLGR Corpus (National Center for Sign Language and Gesture Resources) and the BSL (British Sign Language Project). In the literature, there are some attempts to create parallel textual SL corpora such as the German RWTH-PHOENIX-Weather corpus. But, in general, there is a lack of multilingual large parallel corpora for Sign Languages. This represents a significant obstacle for sign language researches Efthimiou et al. (2009) in particularly for Sign Language MT.

This work is part of WebSign project Boulares and Jemni (2012) that aims to translate text to sign language animation using 3D virtual person. WebSign project is composed mainly from Machine Translation module and animation module. In this paper, we describe the machine translation module that aims to translate a manually pre-treated transcription form of English text to American Sign Language (ASL) textual transcription including Signing Space information (presented in Sect. 3). The choice of the ASL transcription is related to the availability in the literature of relevant works studying the structure of ASL sentences Neidle et al. (2000), Stokoe (1978) which is, in the general, not the case for many other languages. In this context, we created an ASL corpus composed of 300 parallel phrases to train our MT approach as well as 100 different parallel phrases to test the translation process.

However, in the MT field, phrase-based and regression-based MT are considered as the most relevant techniques. The phrase-based MT is well known as a powerful statistical method requiring very big training data to give good translation results. The translation process of the regression based MT relies on linear regression learning of word-to-word feature mapping. For new unknown input phrase (feature vectors), the learned linear regression model is used to predict the target feature vector. Once the target feature vector is obtained, a multi-graph search is used to find all the possible target words whose mappings correspond to the translated feature vector. This step is called decoding process. In order to argue our MT choice, we conducted experimentation based on these two techniques. We proved experimentally that our regression based MT performs better than phrase based MT in the context of small-scale corpora.

The main contribution of this work involves three aspects. The first one aims to use the existing regression approach “Elastic Network” Hastie et al. (2006) which is known as an improvement of the l1-norm Lasso and L2-norm Ridge in term of fitting accuracy R squared score of the regression function Colin et al. (1997). The second one focuses on the application of the Latent Semantic Analysis (LSA) in the decoding process after learning the regression function by Elastic Net method. The third aspect is related to the application of this approach to small-scale ASL parallel corpora with simple ASL phrase structure (Subject Object Verb form).

In this context, we created a small-size parallel corpora composed of 300 manually pre-treated English sentences in order to obtain suitable representation of ASL. We used the N-Spectrum Weighted word Kernel Leslie and eskin (2002), Watkins (2000) to generate feature vectors mapping of both source and target 2-grams. To learn the function that maps source to 2-grams target, we used and compared the l1-norm (LASSO) and the L2-norm (RIDGE) to the Elastic Net method in order to maximize the R squared and therefore to improve the translation accuracy. As a solution to the pre-image problem (decoding process), we used the De-Bruijn Multi-Graph search applied on the 2-grams target. In order to improve the classical decoding process that uses Language Model, we used LSA searching method. We conducted set of experimentation to compare our approach with others i.e MT framework MOSES, LASSO and RIDGE based regression MT.

The remainder of this paper is organized as follows. In Sect. 2, we present the related works. Section 3 is devoted to describe the sign language data. Section 4 is dedicated to present our approach. Section 5 presents our experimentation and main results we obtained. Finally, the conclusion and some perspectives.

2 Related work

One of the main problems of machine translation is how to find the most likely translation of a source sentence from a set of training sentences. The goal is to find significant relationships between source and target language. However, due to the complexity of the problem, it is not easy to express these relationships as a set of rules. In other words, there are no general rules that can generate high quality translation for new unknown input sentences.

In fact, the majority of relevant research works are based on statistical MT and regression based MT. The work of Koehn et al. (2003), Koehn and Hoang (2007) was focused on statistical models whose parameters are derived from the automatic analysis of a set of bilingual phrases. This work relies on the search of the highest probability translation within a number of choices in order to find the most likely translation of an input text. This technique gives good results in term of translation quality based on very big bilingual text corpora. The quality of the results is closely related to the size of the training parallel corpora and to the quality of the unsupervised alignment. However, the translation cannot be of high quality if this technique uses a reduced training set such as on sign language corpora.

Fig. 1
figure 1

The String-to-string mapping

The regression based MT relies on linear regression which is classified as one of the methods of multivariate analysis that deal with quantitative data. The main objective of this method is to seek a linear mapping between one or more quantitative source variable and one or more quantitative target variables. However, MT method deals with the problem of mapping sentences x from a source language \(X{}^{*}\)to a target language \(Y{}^{*}\). Formally, let X and Y correspond to the token sets used to represent source and target N-Gram, then a training sample of m input N-Grams can be represented as: (\(X{}_{1}\),\(Y{}_{1}\) ). . . (\(X{}_{m}\), \(Y{}_{m}\)) \(\in \) \(X{}^{*}\times Y{}^{*}\), where (\(x{}_{i}\), \(y{}_{i}\)) corresponds to a pair of source and target language token string. Input N-Gram in \(X^{*}\)are mapped via \(\varphi {}_{x}\) to feature space \(F{}_{x}\) and the output string are mapped to \(F{}_{y}\) via the mapping \(\varphi {}_{y}\). The mapping can be defined implicitly by a positive symmetric Kernel \(K{}_{x}\) and \(K{}_{y}\) associated with the mappings \(\varphi {}_{x}\) and \(\varphi {}_{y}\). Our goal is to find a mapping f: \(X{}^{*} \rightarrow Y{}^{*}\) that can convert a given set of source phrases to a set of target phrases that share the same meaning in the target language. Our objective is to predict \(F{}_{y}\)with target features (\(K>1\)), based on a multiple regression problems which could be done by introducing a different set of basis functions for each feature. In other words, a multiple regression technique can be used to learn and to estimate the mapping g from \(X{}^{*}\) to \(F{}_{y}\)based on its pre-image set \(\varphi ^{-1}{}_{y}\). Figure 1 depicts the scheme of the translation process presented in Cortes work Cortes et al. (2007).

However, the common approach used is to use the same set of basis functions in order to model all the target features. functions in order to model all the target features.

(1)

This problem can be solved based on the minimization of the sum of squared differences (SSD) in \(\varphi (y)\) on S where S is a set of bilingual sentence pairs. \(S=\left\{ (x_{i},y_{i}):w_{i}\epsilon X^{*},y_{i}\epsilon Y^{*},(i=1...m)\right\} \). This solution [Eq. (2)] is known as Ordinary least Squares MCO or \(L{}^{2}\) Norm that aims to learn the linear operator W in Eq. (1):

$$\begin{aligned} min ||WM{}_{\varphi _{(X)}} -M_{\varphi _{(Y)}})||_{F}^{2} \end{aligned}$$
(2)

where \(M_{\varphi _{(X)}}=[\varphi (x_{1}),...,\varphi (x_{n})],M_{\varphi _{(Y)}}=[\varphi (y_{1}),...,\varphi (y_{n})]\) with M is a matrix and \(||.||{}_{F}\) denotes the square root of the sum of the absolute squares of the elements known as the Frobenius norm (matrix norm). The minimal least squares estimator is given by:

$$\begin{aligned} W&=(M_{\varphi _{(x)}}^{T}&M_{\varphi _{(x)}})^{-1}M_{\varphi _{(x)}}^{T}M_{\varphi (y)} \end{aligned}$$
(3)

Furthermore, the work of Cortes et al. (2007) is based on a regression technique for the purpose to learn a string to string mapping. This approach leads to many other studies in machine translation field such as the work of Wang Zhuoran et al. (2007). Wang et al. are based on a string to string mapping in order to find a linear model by using ordinary least squares (OLS) regression and n-gram string kernels on a small subset of the Europarl corpus. They use the pre-image model as a score to the standard statistical machine translation systems such as phrase-based search Koehn et al. (2003). However, this approach loses some of the main advantages of the regression approach. In fact, OLS is not necessarily the best estimator. There are some cases such as when the two (or more) of the predictor features are strongly correlated and increasing in similar way. In such cases, the determinant of the matrix \(M_{\varphi _{(x)}}^{T}M_{\varphi _{(x)}}\) will be close to zero, which makes an ill-conditioned matrix.

In other words, the minimal least square estimator may causes some problems related to matrix ill-conditioning or singularity in matrix. This is caused by similar or duplicated samples that can be founded in the training set, yielding a large number of solutions. Consequently, the matrix cannot be inverted with as high precision as we’d like and the large variance affects the final parameter estimation. As an improvement of this approach, Wang and Shawe-Taylor Zhuoran and Shawe-Taylor (2008) used the L2 regularized least squares regression in machine translation.

This improvement is known as the Tikhonov regularization or ridge regression Hoerl and Kennard (1970). Even so, this solution gives preferences to a particular solution with a smaller norm by including a regularization term in this minimization as in Eq. (4):

$$\begin{aligned} min ||WM{}_{\varphi _{(X)}} -M_{\varphi _{(Y)}}||_{F}^{2}+\Gamma ||W||_{F}^{2} \end{aligned}$$
(4)

However, using this regularization, the conditioning of the problem will be improved and enables a direct numerical solution. An explicit solution is given by:

$$\begin{aligned} W&=(M_{\varphi _{(x)}}^{T}&M_{\varphi _{(x)}}+\Gamma I)^{-1}M_{\varphi _{(x)}}^{T}M_{\varphi (y)} \end{aligned}$$
(5)

with \(\Gamma \) is the conditioning factor that could be determined by cross-validation and I is the identity matrix.

Although the translation quality they achieved based on Europarl corpora is still not better than statistical phrase-based (Moses framework) Koehn et al. (2007), this approach gives better results on small scale corpus.

Ergun Bicici Biçici and Yuret (2010) work is based on the feature decay FDA algorithm which is a class of instance selection algorithms. He uses feature decay in order to increase the diversity of the selected training set by devaluing the already included features Biçici and Yuret (2010). He used L1 regularized regression for sparse regression estimation of target features and graph decoding to find translation results.

The LASSO or L1-regularized method can be useful in some contexts in order to select solutions with fewer nonzero features values using this explicit formula:

$$\begin{aligned} min ||WM{}_{\varphi _{(X)}}-M_{\varphi _{(Y)}}||_{F}^{2}+\Gamma ||W||_{1} \end{aligned}$$
(6)

Serrano et al. (2009) work is based on the learning of the translation mapping by linear regression applied to constrained hotel front desk requests domain (corpora). Once the target feature vector is obtained, they use a multi-graph search to find all possible target strings. We noticed that the majority of existing works use mainly regression or statistical techniques on spoken languages corpora. Daniel Stein work Stein (2012) and Boulares work Boulares and Jemni (2014), use statistical approach on sign language machine translation using small-sized corpora. In our previous work Boulares and Jemni (2014), we used an approach based on the two techniques of kernel regression and Statistical MT. This method requires a perfect pre-generated word-to-word alignment to give good results. The disadvantage of this work is that the perfect word-to-word alignment, cannot be generated automatically. Furthermore, the Hung-Yu Su Hung-Yu and Chung-Hsien (2009) work, relies on the extraction of the thematic relations between the grammar rules of both Chinese and Taiwanese Sign Language structure from small corpus. The extracted thematic role templates are used as Translation Memory for Statistical Machine Translation. The disadvantage of this work is that in sign language there are no general rules that can be applied automatically. Therefore, the quality of the translation depends on the quality of these extracted rules. The work of Cortes et al. (2007) aims to learn a string-to-string mapping based on the ridge regression method combined to Language Model (LM) and De Bruijn graph for the decoding process. Wang and Shawe-Taylor Zhuoran et al. (2007) work relies also on ridge regression method (L2 regularization) in order to learn the phrase-to-phrase mapping and they used the De Bruijn graph and the Language Model in the decoding process. Ergun Bicici Biçici and Yuret (2010) work is based on the use of the lasso regression method (L1 regularization) to learn the phrase-to-phrase mapping and uses the same decoding process as Wang and Shawe-Taylor Zhuoran et al. (2007). Schmidt et al. (2013) work is focused on sign language-to-text translation based on the correspondence between the mouthing and spoken language words. This work relies on a mouthing recognition system from video of a person signing into a text in a spoken language. They integrate a recognition and translation framework by adding a viseme recognizer through a lip reading system in order to optimize the recognition system and to improve the translation output. Furthermore, we observed that there are no achieved studies that rely on regression approaches in Sign language machine translation. For this purpose, we are interested by the works of Cortes et al. (2007), Wang and Shawe-Taylor Zhuoran et al. (2007), and Ergun Bicici Biçici and Yuret (2010) in order to derive benefit from statistical and regression approaches.

In this paper, we present a novel approach that consists on the use of the elastic net model for regression ( well known as the combination of L1 and L2 regularization) for the purpose to learn the phrase-to-phrase mapping. For the decoding step, we rely on the multi-graph search through the De-Bruijn graph in order to find all possible target words whose mappings correspond to the translated (feature vector). In order to find the best translation from a multitude of combination (paths in the graph), we apply the Latent Semantic analysis (LSA) instead of the LM. We experimented our approach on our small-scale size ASL corpora and we obtained good results as we are presenting in Sect. 5.

3 Sign language data

Linguistic research has proven that American Sign Language has its own internal structure Neidle et al. (2000), Stokoe (1978). The grammatical structure of ASL involves the symbolic meaning of space locations and entities, known as “iconicity”. In fact, iconicity occurs in spoken and gesture languages such as sign language. In sign language, the information could be interpreted by using iconic and non iconic signs. The iconic signs aim to describe an icon or a picture of some aspect of things or activities being symbolized Battison (1978). For instance, in ASL the sign “car” could be symbolized by a standard icon reflecting the word meaning (as shown in Fig. 2 a). Additionally, ASL exploits both of iconicity and space positioning (in front of the signer) for reflecting the visual aspect of the information. This specificity leads us to focus on one of the most important sign language parameters known as the locative expression of sign language.

In fact, American Sign Language use several different ways to ensure the locative reading. For example, to describe entities in a story, signers may use specific hand-shape with movement and hand orientation deployed in the signing space. This sign language manipulation is known as “the transfer of form and size”. It relies on classifier predicates (CP) for symbolizing the discourse entities in the space. Indeed, classifier predicates are considered as part of lexical signs. They consist of hand-shape configuration accompanied with location, palm orientation, movement, and non-manual signals. The different relationships between entities are indicated by the movement shape and the space positioning of the appropriate classifier in the signing space. As shown in Fig. 3, the signer symbolizes the action “ bites” by using the “ Claw” hand-shape with the action from an entity location to another. Classifier predicates rely on several symbols for characterizing each class of entity. For example, the 3 hand-shape classifier could be used to describe several objects such as “CAR”, “BOAT” and “BICYCLE”. In other words, the 3 hand-shape, is a classifier symbolizing the class of objects “ vehicle”. The classifier (CL) with F hand-shape represents small round things: buttons, tokens, etc... The CL V (hand-shape) aims to describe legs, person walking,etc...

Fig. 2
figure 2

a an overview of iconic sign “car”; b an example of a phonological contrast in ASL. These signs differ only in the location of their articulation

Fig. 3
figure 3

An example of the sentential use of space in ASL. Nominal (cat, dog) are first associated with spatial loci through indexation. The direction of the movement of the verb (BITE) indicates the grammatical role of subject and object

Furthermore, the spatial positions associated with referents can also convey locative information about the referent. For example, the phrase “ the dog index” , shown in Fig. 3, could be interpreted as “ the dog is there on my left”. The signer add the sign “index” in order to establish a reference relation between dog and a spatial location. Signers may add a specific facial expression (e.g., spread tight lips with eye gaze to the locus) produced simultaneously with the index sign or with classifier predicates Valli and Lucas (2000). The locative expression in sign language, could be expressed by locative verbs. These verbs exploit the movement direction of the action in order to indicate the location of the entity in the space. For example, the sentence “john throw rock”, the direction of the movement of the verb indicates the direction in which the object is thrown. Signers also can make reference to absolute locations, as when they use the signs for “ east,” “ west,” “ north,” and “ south” Valli and Lucas (2000).

Fig. 4
figure 4

An Example of the transformation of english phrase to ASL form inluding signing space information

Emmorey (2005) has proven that location is a part of all ASL signs and signers use location in many different ways. There are some signs that use body location such as the ASL sign “bored” which is based on the head location “nose”, for sign “feel” the signer uses the chest and for sign “Russian” it is the waist. Also, signers can use the signing space surrounding them, to indicate that the sign can be in front, in left or in right. According to Sandler Sandler (1989), location is a crucial parameter that removes the semantic ambiguity of some signs such as ASL sign “summer”, “ugly” and “dry” shown in Fig. 2b. All of these signs have the same signation manner and differ only in where they are articulated on the body. The study of Huenerfauth Huenerfauth and Lu (2011) has shown that signing space information improves the translation understanding. A phenomenon in which signers use special hand movements to indicate the location and movement of invisible objects (representing entities under discussion) in space around their bodies as shown in Fig. 3. In this example, the signer uses the sign “INDEX” to place the signs “DOG” and “CAT” in order to be used in the action “BITE”. Signing space information is frequent in ASL and is necessary for conveying many concepts. Therefore, the translation process that integrates spatial information is more understandable. Also, this is very helpful for deaf people who may have some difficulties in creating a mental image Charles and Rebecca (2000) reflecting the true meaning.

Furthermore, in order to be able to transcribe gesture language, we have opted to use glossing ASL to represent signs in text form. This representation differs from writing in a spoken language because when we are glossing, the target language may not have the same words order as the original language. This means that English representation needs to be translated to a glossing ASL form that includes signing space information. Consequently, this problem is suitable to be solved by machine translation technology. However, due to the lack of parallel sign language corpus that include English and glossing ASL, we have built a parallel corpus that includes 300 English text phrases and their translation using the glossing ASL form. We proceeded to a manual pre-treatment in order to preserve only the useful words in the English phrases (that will be translated to glossing ASL). For example, the English phrase “Let we go to the restaurant” will be transformed to “LET-GO YOU I RESTAURANT”, here the words “let” and “go” are joined in order to describe one sign, the word “we” is transformed to “you i” and “to” is removed. In the same line, the sentence “The dog bites the cat” is reduced to “DOG BITE CAT”. The ASL glossing form used in this work, includes the signing space information such as main entities of the initial English phrases as shown in Fig. 4.

In the examples of Fig. 4, the words order is changed according to ASL glossing Valli and Lucas (2000). For instance, “DOG BITE CAT” the entities “DOG” and “CAT” are placed on the beginning of the sentence and there is no preferential order between them (“DOG” “CAT” or “CAT” “DOG” is the same). Afterwards, the location information is added to each word using the sign “INDEX” followed by the location according to the signer (on the left, on the right, etc...). Then, these location information are used with the action (that should be placed at the end Valli and Lucas (2000)) in order to refer the pointed entities. As shown in Fig. 4, for the ASL phrase “CAT SIT BELLOW TABLE”, the signer uses the passive hand with the Flat-B classifier (palm faced down) for symbolizing the table. Then he uses the dominant hand for the sign “CAT”. The action “SIT BELLOW” is described by the location of the classifier BENT-2 bellow the table location. Here, the BENT-2 classifier is used to symbolize the meaning of “CAT SIT”. For the ASL phrase “CAT SIT ON TABLE”, the location of the BENT-2 classifier changes to be above the table location. In this example, we notice that the passive hand could be used as a location referent for the dominate hand.

Table 1 summarizes the ASL corpora information. The training size set is composed of 300 parallel sentences containing 3 to 6 words per sentence and the test set size includes 100 different phrases. The corpus phrases are composed mainly from simple SOV (Subject Object Verb) structure. In these phrases we relied on classifier predicates location, INDEX reference and absolute locations (“east”, “west”, “north”, “south”)) for describing the “locative expression” of ASL. The vocabulary size for the source phrases is 1014 and 966 for the target phrases. For the feature mapping (see Sect. 4), we generate 557 2-grams sources and 554 2-grams targets that we used to train the regression function.

Table 1 Our corpus details

4 Our approach

Our translation approach relies mainly on three steps. The first one is the feature mapping process which consists on the transformation of the input data (phrases) into feature vectors with “m” dimensional space. The purpose of this transformation is to facilitate the automatic analysis of textual data. The second step aims to learn the translation mapping between 2-grams sources and 2-grams targets. The result of this step is a linear regression function representing the translation mapping. The third and final step is the pre-image resolution that consists on determining the predicted output of an input phrase.

4.1 Feature mapping of textual data

The automatic data analysis field requires explicit feature vectors of the input data in order to derive useful information for data prediction. However, there are many cases, where the input data cannot be described by explicit feature vectors such as bio sequences, images, graphs and text documents. For such data sets, the construction of a feature extraction module can be as complex and expensive as solving the entire problem Lodhi et al. (2002). It is also possible to lose some important information during the feature extraction process. In other words, the effectiveness of a system is closely related to the accuracy and the performance of the feature extraction process. For this purpose, we may introduce kernel methods that can be considered as an efficient alternative in the feature extraction process.

The feature mapping used by Kernel methods, especially by string kernel, can be directly used by learning techniques in order to predict new data. In general, the most natural and efficient way to compare two phrases is to count the common contiguous n-gram they have. Comparing the n-spectra of two strings can give important information about their similarity, especially in the machine translation field where contiguity is a crucial parameter for the translation accuracy. For this reason, we adopted the n-spectrum weighted word kernel Shawe-Taylor and Cristianini (2004) as a feature mapping technique applied to 2-gram sequences. The feature mapping is defined by:

$$\begin{aligned} k_{n}(p,f)=\sum _{i=1}^{|p|-n+1}&\sum _{j=1}^{|f|-n+1}&k_{n}(p(i:i+n),f(j:j+n)) \end{aligned}$$
(7)

where p represents the phrase, f is the feature and n is the length of contiguous words (grams). The computation of the n-spectrum kernel feature mapping requires O(n|p||f|) operations.

4.2 The regression function

The regression technique aims to learn a model that fits data in a better way. The quality and the accuracy of unknown data prediction are crucial in the machine translation field. However, the minimal least square estimator described by formula (3) is well known as a poor prediction and interpretation model Zou and Hastie (2005), Bishop (2006). Furthermore, regularized version of OLS has been proposed in order to improve the regression accuracy such as L2-norm based regression known as ridge regression and L1-norm based regression known as Lasso regression.

Ridge regression is known as a solution to the multi-colinearity in data by adding a degree of bias to the regression estimates in order to reduce the standard errors. Multi-colinearity causes a large variance in the OLS model which makes estimations far from the true values. The existence of near-linear relationships in data causes multi-colinearity and therefore the decrease of both the accuracy of the regression coefficients and the predictability of the model. In other words, according to Zou and Hastie (2005), ridge regression cannot produce a parsimonious model. A parsimonious model ensures that the initial model will be constrained to estimate a small number of parameters. Therefore, it always keeps all the predictors in the model. Biçiçi Biçici and Yuret (2010) has shown that based on L2-norm, the obtained model cannot generate a sparse solution and the majority of the coefficients remains non-zero which makes the machine translation decoding process more complicated.

Fig. 5
figure 5

\(R{}^{2}\) values of lasso, ridge and elastic net operating on different training corpora size

The regularization based on L1-norm proposed by Tibshirani (1996) provides a sparse model by imposing an L1-penalty on the regression coefficients. The L1-norm approach proceeds to the shrinkage and automatic variable selection simultaneously in order to reduce the coefficient values. Zou and Hastie (2005) has shown that when the number of observations is greater than the number of variables, L1-norm selects at most n variables before saturation and this is an inconvenient in term of variable selection. Also, where there is a high correlated pairwise, L1-norm selects randomly only one variable. In the other hand, if the number of variables is greater than the number of observations and the predictors are highly correlated, the L2 regularization approach is more efficient compared to the L1 regularization Tibshirani (1996). Furthermore, based on the study of Tibshirani (1996), there is no uniform domination between prediction performance of the ridge and the lasso regression. Based on our numerical experience, as shown in Fig. 5, the regression performance measured by \(R^{2}\) values between the lasso and the ridge regression, changes according to the corpora size. If we train the lasso and the ridge regression methods on a corpora size that varies between 50 and 100 phrases, ridge is better than lasso. However, from 150 to 250 phrases, Lasso gives better performance than ridge regression and for 300 phrases, ridge becomes again better than lasso. This numerical analysis validates the theory of the uniform domination between ridge and lasso discussed above.

Zou and Hastie (2005), Hastie et al. (2006), Trevor et al. (2009) proposed a regularization technique, called elastic net, as an improvement of the lasso technique. The elastic net approach aims to overcome the lasso problems cited above on relying on the automatic variable selection, continuous shrinkage and groups selection of correlated variables. Equation 8 is the formal description of the elastic net that contains L1 and L2 (quadratic) parts. The L1 part of the penalty generates a sparse model. The quadratic part of the penalty removes the limitation on the number of selected variables. Consequently, elastic net regularization encourages grouping effect and stabilizes the L1 regularization path. Also, real data examples show that the elastic net often outperforms the LASSO and RIDGE in terms of prediction accuracy as shown in Fig. 5. In fact, relying on our numerical analysis, we conclude that elastic net outperforms both the LASSO and the RIDGE regression in terms of prediction accuracy \(R{}^{2}\) and Mean Squared Error (MSE) despite the corpora size variation (see Figs. 5, 6). For this reason, we adopted elastic net as a regression function for our machine translation process.

Fig. 6
figure 6

MSE comparing between Lasso, Ridge and Elastic net regression function with training set size from 100 to 300 phrases

The elastic net technique is based mainly on the combination of the L1 and L2 penalties:

$$\begin{aligned} L(\Gamma _{1},\Gamma _{2},W)=&||WM{}_{\varphi _{(X)}}&-M_{\varphi _{(Y)}}||^{2}+\Gamma _{2}||W||^{2}+\Gamma _{1}||W||_{1} \end{aligned}$$
(8)

where \(||W||^{2}=\sum _{j=1}^{f}W_{j}^{2}\), \(||W||_{1}=\sum _{j=1}^{f}||W_{j}||\) , f: features with \(j=1...f\). With the minimization of Eq.(8), the naive elastic net estimator becomes:

$$\begin{aligned} \hat{W}=argmin_{w}L_{1_{ratio}}(\Gamma _{1},\Gamma _{2},W) \end{aligned}$$
(9)

with \(L_{1_{ratio}}=\frac{\Gamma _{2}}{\Gamma _{1}+\Gamma _{2}}\). Solving \(\hat{W}\) in Eq.(8) is equivalent to the optimization problem:

$$\begin{aligned} \hat{W}=argmin_{w}||WM{}_{\varphi _{(X)}}-M_{\varphi _{(Y)}}||^{2}+(1-L_{1_{ratio}})||W||_{1}+L_{1_{ratio}}||W||^{2} \end{aligned}$$
(10)

with \((1-L_{1_{ratio}})||W||_{1}\), \(L_{1_{ratio}}||W||^{2}\): are the two parameters respectively lasso and ridge penalty to form the elastic net penalties. When \(L_{1_{ratio}}=1\) the estimator becomes ridge. When \(L_{1_{ratio}}=0\) the estimator becomes lasso and if \(0<L_{1_{ratio}}<1\) the penalty is a combination of \(L{}_{1}\)and \(L_{2}\). As mentioned in Zou and Hastie (2005), the naive version of elastic net method finds first the ridge regression coefficients by fixing the \(\Gamma _{2}\) and then performs a LASSO shrinkage to generate a sparse model. The quadratic part of the penalty removes the limitation on the number of selected variables, leads to a grouping effect and stabilizes the l1 regularization path. In Zou and Hastie (2005), the authors, presented a solution to solve the naive elastic net problem efficiently:

$$\begin{aligned} \hat{W}=argmin_{W}W^{T}\left( \frac{M{}_{\varphi _{(X)}}^{T}M{}_{\varphi _{(X)}}+\Gamma _{2}I}{1+\Gamma _{2}}\right) W-2M_{\varphi _{(Y)}}^{T}M_{\varphi _{(x)}}W+\Gamma _{1}||W||_{1} \end{aligned}$$
(11)

4.3 The decoding Problem

4.3.1 De Bruijn graph

The decoding problem, known as the pre-image problem, aims to find the target sentence Y from the feature vector \(\varphi _{(y)}\) predicted in Eq. (1). In fact, we are based on Eq. (11) to predict the \(\hat{W}\) estimator that will be used to find the new feature values through Eq.(12):

$$\begin{aligned} Y=\varphi _{(y)}^{-1}=argmin||\hat{W}\varphi _{(x)}-&\varphi _{(y)}||^{2} \end{aligned}$$
(12)
Fig. 7
figure 7

An overview of our pre-image solution

The obtained vector values are rounded in order to to obtain integer count as in Cortes et al. (2007). Thus, the pre-image solution is achieved by building a De Bruijn graph with the non zero features values in order to connect the 2-gram features in a same graph. This technique aims to seek all possible paths between nodes and therefore, it identifies all the possible translations as shown in Fig. 7.

4.3.2 LSA search

Latent Semantic Analysis (LSA) is a statistical technique that aims to discover hidden concepts in order to extract relationships between documents. Each document and term (word) is represented by a vector with elements that expose document to document similarities or semantic relationship. Each element in the vector represents the degree of participation of the document or term in the corresponding concept. LSA was presented as an information retrieval method and sometimes called Latent Semantic Indexing (LSI) Deerwester et al. (1990). In Biçici and Yuret (2010) and Zhuoran et al. (2007) works, the automatic translation selection process was based on language model to select the most appropriate translation between a set of possible translations generated from the de-bruijn graph. In fact, as shown in Sect. 5, using LSA in the pre-image step precisely in the translation selection process will improve the translation result in term of accuracy and quality. This improvement is thanks to the semantic aspect of the LSA technique allowing the description of the semantic similarities between the different possible translations generated from the de-bruijn graph and those in the corpora. In other words, our goal is to find the most similar translation between the de-bruijn set of translations and our corpus translations throw the following steps :

Step 1: In this step, we create a vector for each the target phrases in our corpora with all the 2-grams terms based on formula (7) in order to obtain \(n*m\) document-term matrix. Formally let A be the \(n*m\) document-term matrix of the documents collection. Each column of A corresponds to 2-gram term. The dimensions of A, m and n, correspond respectively to the number of words and documents in the collection. we apply formula (7) for weighting all the elements of the matrix.

Step 2: We perform a dimension reduction through Singular Value Decomposition (SVD) De Lathauwer et al. (2000) on A as follow:

$$\begin{aligned} M=USV^{T} \end{aligned}$$
(13)

where \(U{}^{T}U=I\), \(V{}^{T}V=I\); the columns of U are orthonormal eigenvectors of \(AA^{T}\) , the columns of V are orthonormal eigenvectors of \(A^{T}A\), and S is a diagonal matrix containing the square roots of eigenvalues from U or V in descending order.

Step 3: This step aims to find the most similar translation generated by the de-bruijn graph. For each unknown translation T generated from the de-bruijn paths, we apply step 1 on T in order to obtain the vector K. We perform an SVD with the same dimension reduction parameter on K to obtain \(K'\). For the purpose to extract the most similar translation to our target corpus phrases, we use the following formula:

$$\begin{aligned} Tr_{phrase}=argmax(K'M^{T}) \end{aligned}$$
(14)

We repeat step 3 for each generated de-bruijn translation in order to obtain a vector of \(Tr_{phrase}\) formulated by \(Vect_{Tr_{phrase}}\). By applying the formula (15), we obtain the most similar and appropriate translation as shown in Fig. 7:

$$\begin{aligned} Phrase_{Tr}=argmax(Vect_{Tr_{phrase}}) \end{aligned}$$
(15)

5 Experimental study

In order to validate our approach we conducted an experimental study. As mentioned above, we built a corpus of 300 parallel phrases and we use it as a framework of our experiments. First, we applied the three techniques Ridge, Lasso and Elastic Net on our corpus and then, we compared the results. We analyzed ,in particular, the regression function results and the MT evaluation reports. Furthermore, we used in our experiments both LM and LSA in the decoding process. In the next two sub sections, we present the main findings we obtained.

5.1 The regression function

We conducted a detailed experimental study concerning the Elastic Net regression function used in our solution. Figures 8 and 9 represent respectively an experimental comparison of the Elastic net fitting accuracy in term of \(R^{2}\) and MSE with different values of L1 ratio and training set size from 100 to 300 phrases. It is obvious to notice that the variation of the corpus training set size and the L1 ratio penalty (that reflects the degree of the combination of L1 and L2 norm in the elastic net regression function) affects the \(R^{2}\) fitting accuracy.

As shown in Fig. 8, for 100 phrases training size the L1_ratio = 0.1 has the best fitting value (R2) comparing to L1_ratio from 0.2 to 0.9. Also, from Fig. 9, we notice that for 100 phrases training size the L1_ratio \(=\) 0.1 has the minimum Mean squared error (MSE). For 300 phrases the L1_ratio \(=\) 0.9 has the highest fitting value (\(R^{2}\)score) and the minimum MSE. With 100 training phrases and for :

  • \(L1_{ratio_{Elastic}}(0.1)\), \(R_{Elastic}^{2}(0.28)>R_{Elastic}^{2}(0.19)>R_{Elastic}^{2}(0.11)\).

  • \({\textit{MSE}}_{Elastic}(0.0060)<{\textit{MSE}}{}_{Ridge}(0.0068)<{\textit{MSE}}{}_{Lasso}(0.0074\)).

From these experiments, we deduce that Elastic Network regression function improves the performance of the training data adjustment and this is thanks to the combination of advantages of both Lasso and Ridge regression functions.

Fig. 8
figure 8

Experimental comparing of Elastic net fitting accuracy in term of \(R^{2}\) with different values of L1 ratio and training set size

Fig. 9
figure 9

Experimental comparing of Elastic net fitting accuracy in term of Mean squared error (MSE) with different values of L1 ratio and training set size

5.2 Machine translation evaluation

We remind that we used a corpus that contains 300 parallel phrases from English text to AS (including signing space). We used as well, 100 different testing parallel phrases. As mentioned in section 3, the reduced size of our corpus is due to the lack of sign language corpora.

Fig. 10
figure 10

BLEU scores using Elastic Net & LSA search with the variation of LSA dimension and \(L1_{ratio}\)

As shown in Fig. 10, our experimental results show that the variation of the bleu score is according to the variation of two parameters: the LSA dimension used in the translation selection process and the \(L1_{ratio}\) of the Elastic Network Regression function. Furthermore, it is clear that the highest BLEU score is obtained when \(LSA_{dimension}=250\). This LSA dimension value (250) is the highest possible dimension that can be used experimentally for 300 phrases corpus size and this is based on the sklearn python library (sklearn SingleValueDecomposition SVD).

Furthermore, Figs. 11, 12, show also that based on \(LSA_{dimension}=250\) in the translation selection process, our Machine Translation approach has the highest METEOR Banerjee and Lavie (2005) and NIST Doddington (2002), Przybocki (2004) scores.

Fig. 11
figure 11

METEOR scores using Elastic Net & LSA search with the variation of LSA dimension and \(L1_{ratio}\)

Fig. 12
figure 12

NIST scores using Elastic Net & LSA search with the variation of LSA dimension and \(L1_{ratio}\)

Fig. 13
figure 13

BLEU scores comparison of Zhuoran et al. (2007), Biçici and Yuret (2010) and our approach based on our corpus size variation from 50 to 300 phrases using 2-grams

Table 2 A comparison of Machine Translation approches based on MT Evaluation methods experimented on 2-grams and 50 to 250 phrases corpus
Table 3 A comparison of Machine Translation approches based on MT Evaluation methods experimented on 2-grams and 300 phrases corpus

We conducted also an experiment based on the corpus size variation, and as shown in Fig. 13 and Tables 2, 3 , besides corpus size = 100, our solution performs better than Zhuoran wang approach (2008) Zhuoran and Shawe-Taylor (2008) and E. Biçiçi Biçici and Yuret (2010) respectively in term of the metrics : BLEU, METEOR, NIST and F1-MESURE scores. It is also clear (in Fig. 13) that from 65 to 130 phrases the bleu score of wang approach is higher than our bleu score. The performance of our approach becomes stable and better than wang approach, when the size of the corpus reaches and exceeds 130 phrases.

As shown in Fig. 13:

  • 50 phrases : .

  • From 50 to 65 phrases : .

  • From 65 to 130 phrases : .

  • From 130 to 300 phrases : .

  • 150 phrases : .

  • 250 phrases : .

Table 4 Experimental comparing of Elastic Net based MT using both Language Model (LM) and LSA-based decision rule
Fig. 14
figure 14

A comparison of Bleu scores of Ridge, Lasso and Enet with LM and LSA decoding

Table 5 MT evaluation report of Lasso, Ridge and Enet with both LM and LSA decoding

We experimented also Lasso, Ridge ans elastic net regression methods with two different decoding process. In the first decoding process, we used the classical language model with de bruijn graph and for the second, we used LSA as decision rule with de bruijn graph.

As shown in Table 4, we compared the MT evaluation report of Elastic Net with LM decoding and Elastic Net with LSA decoding and we obtained best translation results (in term of the metrics BLEU, NIST, METEOR and F1-MEASURE) using the LSA as a decision rule with the de bruijn graph. Also, we compared the MT evaluation report of Lasso, Ridge and Enet with the use of both LM and LSA in the decoding process (see Fig. 14, Table 5). We obtained an improvement in term of the metrics : Bleu, Meteor, F1-MEsure and Nist scores of the translation results of Lasso, Ridge and Enet using LSA compared to LM decision rule (see Fig. 14, Table 5). In other words, the translation quality and accuracy using LSA with the de-bruijn graph in the decoding process is more powerful than the use of the classical LM with de-bruijn. This improvement is explained by the requirement of very big training data size (contiguous word (2-grams) frequency) for the 2-grams language model to reach the optimal performance. By cons, LSA is based on the principal component decomposition method which has a less degree of influence to small training data size compared to LM method and this is proved experimentally.

In summary, based on our ASL corpus (300 phrases), we tested the performance of our machine translation approach comparing to Zhuoran et al. (2007), Biçici and Yuret (2010) and Koehn et al. (2007) works. We concluded that our approach improves the translation accuracy and quality in term of BLEU Papineni et al. (2002), Denoual and Lepage (2005), NIST, METEOR and F1-Measure Powers (2011), Lavie et al. (2004) scores (see Table 2). We noticed that the performance of our machine translation approach in term of the previous metrics Coughlin (2003), Doddington (2002), Finch et al. (2005), becomes stable from 130 phrases (as shown in Fig. 13; Tables 2, 3). For 300 parallel phrases corpus and for \(L1_{ratio}=0.5\) with \({\textit{LSA}}_{dimension}=250\), we achieved good results (as shown in Table 3) with \({\textit{BLEU}}=28.03\), \({\textit{METEOR}}=0.2527\), \({\textit{NIST}}=6.0384\) and \({\textit{F1}}-Measure=0.5495\). Finally and according to these different experimentation and with the corpus we used, we conclude that our approach gives in general better results than Bicici Lasso, Wang Ridge and phrase-based MT.

6 Conclusion

In this paper, we presented a novel approach for sign language machine translation based on the Elastic network regression & LSA search. We created a sign language parallel corpus containing 300 phrases including signing space specificity. We used the n-spectrum weighted word kernel as a feature mapping technique applied to 2-gram sequences. In the training process, we proved that the Elastic Net regression function performs better than LASSO and RIDGE regression functions. In the other hand, we used the de-bruijn graph combined to the Latent Semantic Analysis approach in order to improve the LM decoding process, and therefore, to improve the translation quality and accuracy. The experimental study confirmed the advantage of our approach as we obtained good results in term of the well known metrics : BLEU, METEOR, NIST and F1-MEASURE. Generally, our approach leads to good results when applied to a reduced domain with simple ASL Subject Object Verb form (short phrases length). The ultimate goal of our future research is to increase the size of our sign language corpus in order to cover all possible ASL vocabulary and to improve consequently the translation quality.