1 Introduction

According to the 2010 Brazilian population census [29], 5.1% of the population of Brazil—about 9.7 million citizens—have some kind of permanent hearing loss (0.3 million people declared being unable to hear; 1.8 million declared severe hearing difficulty even when using a hearing aid; and 7.6 million declared some hearing difficulty even when using a hearing aid). The 2010 Brazilian census also indicates that 30% of those 9.7 million are illiterate and there are no official statistics regarding the number of citizens that are proficient in Brazilian Sign Language (Libras).

According to the census methodology, an illiterate person is a person who cannot or has extreme difficulty in reading and writing in Portuguese. One reason for the high illiteracy rate among deaf people is the high rate of failure in school, which does not occur due to any biological characteristics of deafness, but because of the countless obstacles found in the educational process for deaf students in Brazilian schools. These schools still have a lack of skilled teachers working in bilingual education, instructional materials, or even interpreters whose role is to translate the contents of the school subjects into Libras [25, 58].

Bilingual education for deaf people, which comprises Libras as their first language and the written mode of Portuguese as their second language, is a civil right established by decree 5.626 and supported by the National Education Plan [8]. However, this type of education has not yet been fully put into practice.

In their research, Silva et al. [58] concluded that most of the instructional materials that are produced and adopted in schools are designed for and developed toward students who can hear, and whose first language is Portuguese. This practice harms the educational process for deaf students. The authors also highlighted the importance of a multiliteracy approach for the education of deaf students. This approach can facilitate the participation of deaf students in classroom activities as well as their understanding of the school subjects, which helps them to better learn how the Portuguese language works. Silva and Kumada [57] argue that Brazilian schools need to change in order to better accommodate the deaf students.

Currently, tools available through information and communication technologies and the Internet support the reading, production, and distribution of text in which different kinds of semiosis work in the making of meaning. Rojo [53], for example, highlights how these new technologies allow for an intensive textual hybridization, which can include written, oral, and imagetic language, sometimes all at once, in gadgets and interfaces through hypertexts and hypermedia. These tools make it possible for students to try new and different literacy praxes.

In light of this new reality and with the goal of making schools more accessible to deaf students, some teachers have already adopted the use of Assistive Technologies. Among those technologies, the use of animated avatars, which translate words or sentences from Portuguese into Libras automatically, are being used as a strategy for easing content learning and the interaction among deaf and hearing students [16]. However, it is necessary to emphasize that the current avatar-based systems available in Brazil are still being developed toward working with general vocabulary, guided by ordinary/commonly used lexemes. Thus, teachers of specific subjects as, for example, mathematics, geography, and the sciences face not only a lack of educational material for deaf people but also a lack of field glossaries and/or technical dictionaries, which include specific terms in Libras for each school subject.

In line with new digital literacy trends [53], coupled with bilingual education (which considers and respects Libras as the deaf students first language and Portuguese as their second language), we emphasize the importance of technological appropriation as an ally for the development of accessible instructional materials for deaf and hearing students.

The work described in this paper involves the scientific and technical collaboration between the Center for Rehabilitation Research Studies, the School of Electrical and Computer Engineering, and the Institute of Language Studies of the University of Campinas, as well as the Institute of Computing of the Federal University of Alagoas. The collaboration between these research groups with different domains of expertise, including Linguistics, Social Science, Pedagogy, and Computer Science, aims to develop an automatic Brazilian Portuguese-to-Libras translation system capable of presenting signed information by means of an animated virtual human, or avatar. The system is being designed especially for translating elementary school textbooks. Figure 1 illustrates the target application envisioned.

Fig. 1
figure 1

Illustrative representation of the target application

In the figure, the text marked by the user is presented by an animated 3D avatar in a separate window that can be moved and resized. The objective is to provide an attractive way for the deaf and hard-of-hearing students to experience written and signed content presented side-by-side on a computer screen. This would allow them to follow the written material as easily as their hearing classmates. It is anticipated that the use of this technology in bilingual educational environments will improve the motivation and performance of deaf students. With the proper pedagogical strategy, the technology can be applied to activities in the classroom and also at home, giving more flexibility and time for students to process and convert information into knowledge. In addition to the educational application, the automatic translation into Libras can also assist the deaf and hard-of-hearing in accessing the Internet and thus opening them up to a whole universe of written information.

This paper presents an overview of the translation system under development. The remainder of the paper is organized as follows: In Sect. 2, we survey prior efforts to developing an automatic translation system from written text to sign language; Sect. 3 describes the approach followed, Sect. 4 the intelligibility evaluation and finally, Sect. 5 discusses the positive and negative implications of some project decisions already made and challenges yet to be faced.

2 Related work

The automatic translation of written text into sign language without human intervention is a machine translation (MT) problem. Since sign language is a visual language based on manual, facial, and body movements, the result of the translation is typically presented by a signing avatar. A signing avatar is an animated 3D model of a virtual human that presents messages in sign language.

Considering the different translation strategies found in the literature, text to sign language systems can be classified in:

  • Rule-based systems, which rely on different levels of linguistic translation rules;

  • Data-driven (or corpus-based) systems, which build computational knowledge from examples [64] or from statistical models derived from large bilingual parallel corpora [30];

  • Hybrid systems, which combine aspects of the rule-based systems with the data-driven approach [51].

Early MT initiatives for sign language were based on sets of rules derived from a deep knowledge of the lexical, grammatical structure and contrastive characteristics of source and target languages.

Examples of systems that adopted the rule-based approach are ZARDOZ [70], TEAM [74], the Polish Text-into-Gesture Translator (TGT) to Polish sign language (PJM) [65], the South African Sign Language Machine Translation (SASL-ML) project [69], and the many variants developed under the ViSiCAST and eSIGN European projects [21]. More recently, Porta and colleagues proposed a rule-based MT system from Spanish to Spanish Sign Language (LSE) [50]. In the work, the authors highlight that the rule-based approach is advantageous for sign languages since, from a natural language processing point of view, they are still under-resourced and low-density languages in terms of corpora and lexicons. Moreover, the rule-based translation strategy can be considered domain independent.

However, from another point of view, sign languages also impose important challenges to the automation of translation by rules. First, the linguistic knowledge of a certain sign language may be limited, which prevents the definition of robust translation rules from source to target language. This is the case, for example, of sign languages that just recently have been documented and those that are considered emerging sign languages [36, 44, 55, 66].

Second, sign languages comprehend visual communication mechanisms that are not modeled by simple syntactic or semantic analyses. This is the case, for example, of Classifier Predicates (CP). CPs are sign language mechanisms used to visually describe scenarios and sequences of events. For example, in the sentence, “the boy climbed the tree and fell,” the signer may use hand and arm configurations to represent a tree in the space and use the “boy” sign to simulate that he is climbing the tree and then falling. Without abandoning the rule-based paradigm, Huenerfauth [28] proposes the modeling of CPs through the association of lexicalized syntactic structures with 3D spatial directives for the signing avatar at the cost of limiting the rules domain.

The technological evolution that resulted in the increase in computational power, inexpensive storage, and greater access to motion capture equipment leveraged the building of parallel corpora for sign languages and the exploration of data-driven MT approaches. The data-driven approaches can be divided into two main strategies: Example-Based Machine Translation (EBMT) and Statistical Machine Translation (SMT).

In analogy to humans, EBMT systems construct translations of new unseen content in the source language by making use of previously seen translation examples, rather than by performing a deep linguistic analysis. For sign languages, a database of examples is typically constructed through the annotation of a video of a sign language interpreter. Glosses, sometimes accompanied by interrogative or negative markers, are frequently used to annotate the signs presented by the interpreter. The corresponding annotation of the video in the target language provides the missing information to build a database of examples. Examples of EBMT systems for sign language are the English into Sign Language of the Netherlands (NGT) [45], the Arabic into Arabic Sign Language (ArSL) described in [1], and the automatic translation system into JSL (Japanese Sign Language) [67].

In SMT systems, a translation model is typically trained over a large bilingual sentence-aligned corpus. The model is capable of determining translation probabilities based on metrics derived from the corpus, such as frequencies of word/gloss co-occurrences, relative source and target word/gloss positions and sentence lengths. Among the initiatives to implement SMT systems into sign language, we highlight the works from Bungeroth and Ney [11], based on a parallel corpus for German Sign Language (DGS), the Czech TTSL system, and the work from Othman and Jemni [48], which implemented an SMT from English written text to American Sign Language (ASL) gloss.

While EBMT and SMT approaches can be considered independent of a deep linguistic analysis of the target language, their translation performance is heavily dependent on the diversity of sentences and the size of the corpus. As pointed out by Porta et al. [50], for SMT systems in particular, when data are scarce or in sparse domains, induction of correct translation hypotheses is difficult or impossible. Among the strategies studied to improve the translation result in situations where the corpus is not sufficiently large, Massó and Badia [41] suggest the use of morpho-syntactic information from the sentences result. Recent initiatives to build large parallel corpora are also reported in the literature [23, 62].

Hybrid approaches arise as an alternative adopted by systems that aim to take advantage of the data-driven approach without losing too much translation quality. This is the case for the present work, which adopts Falibras, a TTSL system from Brazilian Portuguese into Libras [10]. Since its first release in 2002 [14], the system has undergone many cycles of development and improvement. Currently, Falibras is characterized as a hybrid approach that combines rule-based syntactic transfer and data-driven approaches [15] using knowledge derived from a translation memory [9] and a database of previously translated terms and segments used as a reference for similar translation cases [61]. Additionally, to disambiguate terms that can be translated in different ways, Falibras also implements a statistical model that takes into account the context and frequency of the occurrence of the terms from previous translations. Details of the implementation are provided in Sect. 3.

Another example of a hybrid system is described by San Segundo et al. [54]. The system differs from Falibras because it includes a speech recognition module capable of translating Spanish speech to Spanish sign language (LSE). Our approach is focused on text-to-sign language translation.

Finally, it is also important to observe that MT systems for sign language are typically designed to resolve accessibility problems for deaf people regarding real services or relevant information, such as postal office services [17], driver’s license renewal service [54], access to disaster messages [40], weather forecast information [71], and public transport passenger information [56]. Such systems focus on the modeling of the signs and sentences that frequently occur in the contexts in which they are specialized.

Similarly, the present work focuses on the deployment of an application for the interactive translation into Libras of elementary school textbooks. In the present version, the corpus is built based on the contents of a science textbook for young children (see Sect. 3.1).

3 Our approach

Figure 2 presents an overview of the Brazilian Portuguese to Libras TTSL system under development. In the figure, the rectangles with rounded corners represent processes, and the cylinders represent main databases. As depicted in the figure, the system is composed of two main modules, namely the Translation Module and the Animation Module.

Fig. 2
figure 2

System architecture

The system receives as input text in Brazilian Portuguese to be presented by the signing avatar. Based on a set of rules and the knowledge obtained from a translation memory, the Translation Module analyzes the input text and converts it into an intermediary representation—the Intermediary Language. The Intermediary Language is the input for the Animation Module which maps symbols of the Intermediary Language to signs in Libras.

The signs in the Animation Module are stored in a database using a parametric description (see Fig. 2). The parametric description includes information like the configuration and orientation of the hands, place of articulation, and the type of the movement that defines a template sign. During synthesis, each template sign can be modified to handle word inflections informed by the Intermediary Language. A word inflection can result in changes in sign parameters associated, for example, with intensity, speed, repetition, and facial expression. The signing avatar is a three-dimensional (3D) geometric model that is driven by sign descriptions and an algorithm to concatenate the sequence of signs.

The following sections describe in detail our approach for the implementation of the proposed architecture. Section 3.1 describes the approach for building a parallel Brazilian Portuguese/Libras corpus which is used to feed the databases depicted in Fig. 2. Section 3.2 details the implementation of the Translation Module. The Intermediary Language is described in Sect. 3.3 and the Animation Module is described in Sect. 3.4.

3.1 Parallel Brazilian Portuguese/libras corpus

Our approach for implementing a Brazilian Portuguese to Libras translation system is anchored in the analyses of a parallel Brazilian Portuguese/Libras corpus. The analyses of the corpus allow us to identify general rules of translation and exceptions to these rules, accumulate examples of translation, and model the production of signs. The translation rules are stored in the Rules Database and the exceptions in the Example of Exceptions Translation Memory (see Fig. 3). Accordingly, examples of translation are collected in the Examples of General Translation. The Parametric Sign Description Database contains descriptions of how to produce signs in the format expected by the Animation Module.

The guidelines for building a parallel Brazilian Portuguese/Libras corpus are aligned with the primary goal of deploying an application for the interactive translation into Libras of elementary school textbooks, as illustrated in Fig. 1.

3.1.1 Methodology for compiling a Brazilian Portuguese corpus

Considering the target application (see Fig. 1), an important step in our methodology was to understand the realities faced by the deaf community, in particular deaf children in the first years of their school life, regarding the support material available for their special needs.

A brief survey in the Brazilian state of Sao Paulo involved a sample of ten schools, both public and private, with deaf individuals among their students and interpreters capable of intermediating the interaction between teachers and students. The survey showed that, among the ten schools analyzed, eight schools were unaware of any educational material for teaching deaf children and only two schools knew of a single source material. Of the two schools that had contact with any source material, only one makes use of a source material that is available on the market today [20], which consists of a book designed for hearing students accompanied by a CD with the translation of its content into Libras. This scenario highlights the accessibility gap that exists between deaf students and educational textbooks. Based on the survey, we decided to build a parallel corpus from the content of a textbook. The task of choosing the textbook was based on two guidelines: the needs of the target group (deaf children), and the educational quality of the book.

Our understanding about the educational needs of deaf children was constructed from the experience of professionals who assist deaf children older than 7 years of age at the Center for Studies and Research in Rehabilitation at the University of Campinas. An important observation reported by these professionals regards the huge difficulties faced by deaf teenagers in understanding the course subjects they are supposed to learn during their later school years (young students aged 16–18). They observed that deaf teenagers hardly understand written Portuguese in the textbooks—not necessarily because they are unable to read or write, but because they have not acquired the background knowledge necessary for understanding the subjects presented. Therefore, it is important to support the reading and writing activities of texts starting in the early school years in order to allow deaf students to gradually obtain the basic knowledge required for comprehending more complex texts.

Based on this diagnosis, we decided to build our corpus by compiling the text from a textbook used during the early school years. Based on its practical merits for life and for the study of disciplines such as biology, physics, and geography, it was decided to select a science textbook designed for the third-grade students (aged 8–9).

In order to select the book to be analyzed, we first defined a set of candidate third-grade science textbooks from a list of books recommended by the Brazilian Ministry of Education and Culture (MEC). Second, we asked ten schools with deaf students which books they currently use among those recommended by MEC. These books were given greater weight in the analysis. The set of candidate books was then further analyzed according to three criteria: the visual organization of the book, the progression of complexity in the use of the written language, and the translation difficulties associated with the written content.

We adopted the visual organization of the book as a selection criterion because visual cues are extremely important for deaf people due to the fact that they occur through a channel accessible to the deaf [59]. In this sense, the role of visual attention for deaf persons is equal to the importance of auditory attention for hearing persons. According to Siple, when the input information occurs through a visual channel, deaf persons have greater benefits in cognitive and language development. The importance of visual elements in pedagogy for the deaf is stressed by authors such as Reily [52], who state that visual literacy for the deaf is an especially important skill because the use of images favors the learning of concepts and the association of ideas. Researchers from the New London Group, who are concerned with multiliteracies, indicated that broadening the horizons of the school to provide access to digital technologies (PCs, phones, tablets, etc.) is necessary in order to allow students access not only to written texts, but also to multimodal texts [46].

Considering the complexity of the written language of the book, we initially thought that one of our criteria should be to analyze whether the vocabulary was introduced slowly, with explanations important for deaf children. Words like “earth,” “planet,” “world,” and “universe,” for instance, had to be clearly explained and differentiated. Hearing children listen to these words constantly in adults’ conversations or in the media, though we knew that for deaf teenagers, the differences between such words were not very clear, since they are all represented by the same sign in Libras. We also thought that syntax was a point to be observed. We wanted to avoid books with too many long sentences, especially those with long sentences at the beginning of the book. The analyses of the textbooks, however, showed that these criteria were not sufficient in that they were impossible to fulfill. It was not possible to find a book that dealt with the Brazilian Portuguese language in a way that introduced more complex structures after simple structures. At that point, we decided to reconsider the criteria, as such textbooks had been written for children who had Brazilian Portuguese as their mother tongue. Children that have Libras as their mother tongue were not the target group of the schoolbooks. In this context, it is also important to stress that deaf children undertake a more difficult task than others because they are learning Brazilian Portuguese at the same time as they are learning the contents of written textbooks. They have to read about things they do not know in a language that is not their mother tongue and somehow translate that knowledge into Libras.

3.1.2 Textbook translation

Following the textbook selection, the next step in our methodology was to perform its translation to Libras. The translation process was performed through the close interaction of hearing native Brazilian Portuguese speakers proficient in Libras and deaf individuals having Libras as their first language and reading skills in Brazilian Portuguese. The textbook content was divided in more than 2000 sentences and, seeking to guarantee an acceptable level of standardization, the proposed translations were checked against well-known Libras dictionaries [12, 37]. A major challenge faced during the translation process was that even the most widely known Libras dictionaries in Brazil are still limited to everyday vocabulary, missing signs for many specific scientific and technical terms. In order to avoid fingerspelling, or dactylology, our strategy was to establish a network of deaf and hearing people, teachers, and technical professionals with mastery of Libras, to discuss signs found on field-specific Libras glossaries or sign compedia and, when necessary, to create and validate new scientific or technical signs. At present, the network is composed of 22 individuals and the textbook translation resulted in the creation of more than 160 signs to represent scientific terms.

Besides, the following aspects were also taken into consideration during the translation process:

  • Respect for the most common syntactic structure in Libras, that is, the “topic-comment” structure instead of the structure commonly used in Brazilian Portuguese, which is characterized by the order “subject-verb-object.” It is important to emphasize that the structure “subject-verb-object” is also found in Libras [22], though our choice came from more recent studies of linguistic descriptions in Libras pointing that the “topic-comment” structure is currently more recurrent [49].

  • The use of Libras classifiers to explain actions usually consists of several words in Brazilian Portuguese. Libras classifiers synthesize long sentences of Brazilian Portuguese through short sentences that incorporate the form and/or movement of the object or subject. In this way, this structure moves away from a more literal translation.

  • The presence of polysemy and homonymy in Libras (for example, when we encounter the Brazilian Portuguese words “planet,” “Earth,” “world,” and “universe” whose signs are equally represented in Libras). With Sentences like “After Earth, how many planets will pass until you reach your destination?” we observed difficulties in interpretation because “Earth” and “planet” have the same sign. In this particular case, we chose to translate “Planet Earth” through the sign for “planet” plus the manual spelling the word “Earth.”

The final result of the translation process was a set of videos of interpreters and deaf individuals signing the translation of the textbook sentences to Libras. In addition, each video was transcribed as a sequence of glosses, which corresponds to the sequence of signs recorded in each video.

3.1.3 Sign acquisition process

Considering the more than 9500 catalogued Libras signs in the largest compendium ever written on the subject [13], and taking into consideration the fact that Libras is a “living” language (constantly changing and evolving), the construction of a sign database is a major challenge to be faced.

The strategy for tackling this problem involves the use of motion capture (MoCap), which enables the capture of the movement of the body, limbs, head, and face in the tridimensional space. The use of MoCap technology to construct sign language corpora has been previously applied, for example, in the works of Gibet et al. [24], for the construction of a French Sign Language (FSL) corpus, and Lu and Huenerfauth [38], for the construction of an American Sign Language (ASL) corpus. MoCap technology enables the recording of natural and accurate description of signs and sentences performed by a signer. Such a realistic description of signs and signed sentences is applied to improve the realism of signing avatar animation. In addition, a relatively large amount of signs can be recorded in each MoCap session, facilitating corpus construction and further expansion.

The motion capture sessions were guided by the video recordings resulting from the textbook translation process (see 3.1.2). The MoCap equipment used is a passive optical technology that consists of 8 infrared cameras and reflective markers that are fixed to the Libras interpreter body. In our configuration, 41 markers were arranged on the upper body of the interpreter. During a recording session, the system is capable of capturing, simultaneously, the markers’ trajectory in space and the video of the signs presented by the interpreter. Figure 3 shows snapshots of a MoCap session. The MoCap setup also included two video cameras positioned one in front of the deaf native signer and the second capturing its side view. The material recorded by the cameras was used during the annotation process, as detailed in Sect. 3.1.4.

Fig. 3
figure 3

Mocap session

Even though composed of eight cameras, the MoCap equipment may loose track of a marker when it does not correctly recognize the same marker in at least two frames captured by different cameras. Thus, after each MoCap session, the captured information has to be manually inspected to eliminate noise and correct tracking flaws. In addition, due to unavoidable joint occlusions of the fingers that occur during Libras signing, the MoCap equipment is not capable of satisfactorily tracking finger movements. Consequently, the joints of the fingers have to be corrected in the post-production process, which uses the video recorded during the MoCap session to retrieve the configurations of the hands. Facial expressions are also an important non-manual semantic component of sign language communication. In order to capture such expressions with accuracy, the head-mounted MoCap system Vicon Cara is being used. The system is equipped with four 1280 \(\times\) 720 pixels, 60 fps, cameras (Fig. 4). The facial expressions captured are carefully combined with data from the upper body gestures during the MoCap post-processing phase.

Fig. 4
figure 4

Head-mounted MoCap system to capture facial expressions which have semantic meaning in Libras

In the last step of the MoCap process, the MoCap data are semi-automatically analyzed in order to generate the parametric sign descriptions that are stored in the associated database (see 3.4.1).

The present version of the built corpus comprehends more than 2000 sentences, approximately 8 h of raw material, corresponding to the complete content of a school science textbook.

3.1.4 Annotation process

The material recorded by the frontal- and side-view cameras, synchronized with the spatial marker coordinates provided by the MoCap system, was used to annotate and segment the temporal frontiers of each sign for each translated sequence. The annotation schema includes the following tiers:

  • English translation of the sentence;

  • Brazilian Portuguese translation of the sentence;

  • the sequence of glosses (identification of signs in Libras);

  • hand configuration identification for both hands;

  • facial expression identification;

  • the presence of narrative features;

  • comments.

The annotation tool used was ELAN (EUDICO Linguistic Annotator) [72]. For each translated sentence, the following synchronized parallel data constitute the Brazilian Portuguese/Libras corpus: a frontal-view camera recording, a side-view camera recording, the signing gestures as MoCap C3D and BVH files and the corresponding ELAN Annotated Format (EAF) files. A more detailed description of the corpus construction is provided in [19].

3.2 Translation Module

The Translation Module processes the information based on the grammatical structures of the source and target languages. Figure 5 presents the activities of the translation process, showing when the databases presented in Fig. 2 are used. The module is composed of a text preprocessor, a grammatical analyzer, three complementary translators, and a sentence post-processor.

Fig. 5
figure 5

Translation process

As depicted in Fig. 5, the module receives a Brazilian Portuguese text as input to be converted into the Intermediary Language. At the beginning of the conversion process, the input text is preprocessed in order to replace elements of the sentence and then prevent some semantic problems in translation. Examples of expressions to be replaced are: figures of speech, domain-specific jargon, and wrongly spelled words. Following that, the module executes a grammatical analysis of the preprocessed text, in order to obtain morphological and syntactical annotations about each word. After that, the module uses the grammatical information to assess whether each sentence fits as an exception previously stored in the Examples of Exceptions Database. Exceptions are restrictive translation scenarios that cannot be accommodated in the Example Translation Database. Examples of exceptions are translations involving the construction of interpretation scenarios. When the input given is not an exception, the text is checked to see whether it corresponds to a example stored in the Examples of Translation Database. Such examples are indexed according to the morphological and syntactic structures of the sentences, in order to facilitate the identification of similar sentences to be translated. The examples in the Example of Translation Database are used to speed up the translation process and improve its quality by providing a translation rule specific for the grammatical structure of the sentence. If there is no match between the text and a translation example, the input is then processed according to standard translation rules stored in the Rules Database.

Finally, after executing one of the translation strategies, the translated sentences go through a post-processing step, in which the translated sentences are mapped to annotated glosses, following the structure of the Intermediary Language. For this however, some semantic ambiguities should also be resolved. For example, in the sentence “Esse um campo complexo da ciência” (This is a complex field of science), the word “field” should be translated to gloss “OCCUPATION-AREA” instead of “GROUND.” Such semantic analysis is done by using a Bayesian Network [18] for estimating the probability of each meaning based on the context provided by the words of the sentences, as well as the words of adjacent sentences. The outputs of the post-processor are sentences in the Intermediary Language (see Figs. 2 and 5), which will be detailed in 3.3.

3.3 Intermediary Language

The Intermediary Language uses glosses to represent signs accompanied by information associated with inflection. The Intermediary Language was devised to provide a straightforward mapping to signs and defines animation commands that are interpreted by the Animation Module in order to control the movements of the avatar.

The structure of the Intermediary Language is presented in Table 1, following the Bakus–Naur Format [43].

A translation (<translation>) is defined by a type (<type>) and a sentence (<sentence>). The type defines whether the sentence is an affirmation, a question, or an exclamation. The type information is needed when signing since it may interfere with facial expressions or even with the speed of the sign.

A sentence is composed of a list of words (<word>), separated by commas. A word can be defined by a non-verb inflection (<nonverb_inflection>) followed by a non-verb (<non_verb>) or by a verb inflection (<verb_inflec tion>) followed by a verb (<non_verb>). When the word is a verb (<verb>), the sign can also be changed according to modifiers found in the sentence (<verb_inflection>). Examples of such modifiers are fast, slow, fixed, and intense, which can interfere with the speed, the repetition of the sign, and the facial expression as well. An analogous organization is applied for non-verb words. Table 1 is a partial representation of the structure of the Intermediary Language and presents enough information to give a general idea of the language.

Table 1 Structure of the translation in the intermediary language

To illustrate the use of the Intermediary Language, the sentence “Vamos desenhar as estações do ano?” (Shall we draw the seasons of the year?) is translated in Libras into three signs and represented in the Intermediary Language by the following string of words (see Fig. 6): question <type>, ESTACAO-DO-ANO <non-verb>, present <time> DESENHAR <verb>, present <time> ACEITAR <verb>. The Animation Module handles this text and animates the avatar to produce three signs in Libras: “ESTACAO-DO-ANO” (season of the year), followed by “DESENHAR” (to draw), and then “ACEITAR” (to accept). Moreover, since the sentence is a question, the avatar indicates it by emphasizing a doubt in its facial expression.

Fig. 6
figure 6

Example “Vamos desenhar as estações do ano?”

Another example can be seen in the sentence “No inverno chove muito” (It rains a lot in winter) which is represented in the Intermediary Language by (see Fig. 7): affirmation <type>, INVERNO <non_verb>, intense <inf_aspect> present <time> CHOVER <verb>. In this case, the Animation Module produces the animation of two signs: “INVERNO” (winter), “CHOVER” (to rain). However, the sign for the verb “CHOVER” is modified to indicate the intensity of the “CHOVER” (intense <inf_aspect>).

In the Intermediary Language, each individual sign is represented by a unique name, or gloss.

Fig. 7
figure 7

Example “No inverno chove muito”

3.4 Animation Module

The Animation Module is responsible for animating the avatar and presenting the resulting animation on a computer screen. Control of the movements that are performed by the avatar is based on the information provided by the Intermediary Language, which specifies a sequence of signs accompanied by inflection information, when appropriate. The inflection information is used to modulate parameters of the production of the sign, such as velocity, spatial amplitude of the movement, and number of repetition.

The visible part of the signing avatar comprises a three-dimensional polygonal mesh (Fig. 8, center) with associated skin, clothing, and hair textures to create a realistic representation of a human being (Fig. 8, left). Figure 9 shows the avatar as seen by the user.

Fig. 8
figure 8

The three-dimensional model of the avatar (from left to right): as presented to the user; polygon mesh; and control skeleton

Fig. 9
figure 9

Signing avatar

The actual movement of the avatar is controlled by its skeleton (Fig. 8, right). The skeleton is a system of virtual bones and joints arranged hierarchically to simulate the existing joints of the human body. The movement of each bone influences the deformation of predefined regions of the polygonal mesh to create the illusion of natural body movement. As with the body animation, the facial animation of the virtual three-dimensional model is also controlled by a set of joints (Fig. 8, right). The information required for the handling of the bone-and-joint system is retrieved from XML files containing the parametric description of the signs stored in the database of the Animation Module (see Fig. 2). Details of the format and structure of the XML sign description are discussed in 3.4.1.

3.4.1 Sign description

There is a wide range of objective and subjective elements that together generate meaning in a gestural sequence in sign language. Stokoe Jr. [63], in his pioneering work, enumerated the components of the signs, analyzing the morphological particles of the gestures correlated to the phonemes of the oral languages. This approach was further developed by Liddell and Johnson [35], incorporating components such as position, orientation and hand configuration, movement, trajectories, and locations and points of contact, as well as non-manual elements. Rhythm, cadence, and pauses also contribute to the effectiveness of message transmission in sign language.

Despite previous efforts, there is no consensus on the structure of sign languages [42]. Linguists and researchers have not yet agreed upon which information should be considered relevant and recorded in transcriptions. Unlike spoken language, which for hundreds of years has been represented by a quasi-phonological system—the alphabet—sign language still lacks a widely accepted writing system.

The proposed approach is based on the sign description format described in Amaral et al. [4] and Amaral and De Martino [3]. The format is based on XML language and describes the relevant characteristics of signs needed to synthesize the animation of the avatar. Essentially, the format considers that a sign or a sequence of signs can be described by a sequence of keyframes, or poses, and the transitions between them. The main features of the sign description format can be summarized as follows:

  • The format structures the characteristics of the signs hierarchically;

  • the format uses textual notation widely used for computational processing (XML-based), thus facilitating implementation;

  • the format can represent sequentiality and concurrency in a sign;

  • the format is capable of expressing the concept of symmetry, which leads to a more succinct sign description;

  • the format describes the configuration of the hands, location points, and signing space, and other relevant features from a geometric point of view;

  • the format also allows for the representation of non-manual, facial, and body expressions.

According to Johnson and Liddell [3134], a sign can be described as a composition of two stages: In a first moment, we can observe hands, face, and body stationed in space, and then they move until they reach a new moment of stagnation. The format represents these two stages by the notation elements “pose” and “movement,” which respectively describe the configuration of the joints of the avatar skeleton in the start position followed by movement away from this initial state.

It is important to note that textual description of the subtle movements performed during signing is not trivial task. However, tests involving deaf people using a prototype implementation of the avatar have shown that our approach produces an intelligible sequence of signs. More details about the description of signs and the intelligibility test can be found in Amaral [2], Amaral et al. [4], and Amaral and De Martino [3].

3.4.2 Concatenating signs

As stated previously, the articulation of avatar joints is made by interpolating translations and rotations of bones in the control skeleton between two or more keyframes. If performed without further refinement, the resulting animation may appear overly artificial, which may hinder the understanding of the content. To prevent this type of problem, the joint interpolation will use mathematical models retrieved from MoCap data using techniques known as curve fitting and curve approximation.

The keyframes extracted from the description database (see Fig. 2) constitute atomic signs; to create a sentence, these have to be concatenated. A naive approach would be to interpolate a sign’s final keyframe values to the following sign’s first keyframe values. The computational cost of this approach is minimal, but the result is not visually appealing since it does not consider any physiological constraints and thus renders the movement mechanical and unnatural. We approach this problem by analyzing motion capture data from movements between signs in order to create a mathematical model that can emulate hand, arm, body, and facial movements more naturally.

The study of human arm movement is a vast one in neurophysiology. Much research has been conducted in order to understand how this movement is planned, as in Gottlieb et al. [26] and Soechting et al. [60]. This study is relevant not only in movement analysis, for instance in increasing a baseball’s velocity [27], but also in computer graphics in order to better predict the movement of a character’s arm [73]. While there is not yet a consensus on the existence of a universal model able to predict arm movement, some researchers found a minimum torque-change model that has, with some accuracy, predicted planar arm movement [68]. Others have shown experimental results indicating a correlation of torques in arm joints to achieve a final grasp orientation [39], but still other studies have shown that, for wider or different range, of movements than those presented in previous works which has led to a hypothesis of a law of movement, such a law does not apply. Specifically in Soechting et al. [60] the experimental results did not conform to Donders’ Law, which predicts that the final position of the hand in space is achieved by a unique posture of the elbow and angles of the shoulder.

What is interesting about these studies is that each tend to deal with a very specific arm trajectory—a baseball throw [27]; a constrained movement of the arm starting at 18 predefined locations and ending at 5 predefined targets [60]; a reaching to fixed objects [39]—which may explain the different results. These experiments have also been conducted using a movement that starts and ends at rest; that is, the initial and final velocity is zero (in the avatar, movement usually does not end or start at zero velocity). Considering the specific needs of our application, we decided to construct a transition model derived from the analyses of the motion capture data from our parallel corpus.

Currently, we are studying the MoCap data, adjusting it to different mathematical models and evaluating the results. In the future, we also intend to explore machine learning algorithms to find and tune the models of transition based on initial and final arm postures.

4 Intelligibility evaluation

The final quality of the system presented in Fig. 2, including the accuracy of the translation and the animation aspects that influences the signing avatar intelligibility, is going to be highly dependent on the evolving design of Translation and Animation Module, through intermediate evaluation steps.

Considering this approach, we performed an isolated sign intelligibility test, or ISIT, as an early assessment to check the foundations of our animation strategy. The ISIT seeks to measure how intelligible isolated signs produced by our animation system are. The ISIT was conducted using a set of 18 signs from Brazilian Sign Language (see Table 2). Each sign was presented as a video clip of a real interpreter and animated by an avatar. The conducted experiment focused in evaluating the proper modeling of hand configurations and the graphical presentation of the avatar. For this reason, the sign animation was not driven by motion capture data, but instead it was implemented a key-pose animation approach. Both types of stimuli (video and animation) were presented as single signs alone, without being part of a sentence.

The ISIT is in accordance with the guidelines of biomedical research involving human beings of the National Health Council of Brazil. It was approved by the Human Research Ethics Committee of the School of Medical Sciences of the University of Campinas.

Eighteen signs allowed us to keep the total duration of the test to around 20 min. We selected a set of commonly used and well-known signs. The selected signs cover a broad variety of hand configurations and movements from Brazilian Sign Language.

Table 2 Signs used in the isolated sign intelligibility test or ISIT

Eighteen video clips of a real interpreter and 18 animations of the avatar, corresponding to each one of the 18 signs were presented to 33 participants, aged between 12 and 35, 16 of whom were deaf and 17 of whom were not deaf; all were Libras signers.

The participants had no previous knowledge of the research and the goals of the test. The animations were always presented to the participant before the video of the real interpreter to avoid inference from the video about the signs presented by the avatar. The order of the presentation of the signs was randomly shuffled for each new participant. After evaluating the signs generated by the avatar, the user also evaluated the signs articulated by the real human interpreter.

A software tool was especially developed to present the animations and videos to the participants, and then accept and store the answers. The tests were conducted in a Core 2 Duo 1.6 GHz with 4 GB of RAM notebook computer with a GeForce 8600 M graphics processing unit with 256 MB of RAM memory. Each participant evaluated the material in an isolated room, without contact with other participants and without access to any additional aids such as dictionaries, books or Internet sites. Before starting the evaluation, a brief introduction explaining the test was presented to the participants by an instructor–orally (for the hearing individuals) and signed. The software tool was designed to present each animation/video only once. To start a new animation/video, the participant had to click a command button on screen using a mouse. After the presentation of an animation/video, the participant was asked to type in a text box the meaning of the sign, or leave it empty if he/she did not recognize the sign. After advancing to the next animation/video, it was not possible to return and change a previous answer. After collecting the input of the participants, the answers were analyzed to identify and group the answers with similar meaning as, for instance, “andar” and “caminhar” (to walk and to go on foot). Typographical errors and misspelled words were also corrected during this analysis.

Table 3 presents the number of correct answers for each sign and the related intelligibility scores. In the table, the columns under the heading “# Correct Answers” present the number of correct answers for the videos of the real interpreter (“Video” column) and for the animations (“Animation” column). The scores presented in the columns listed under “Intelligibility Scores” are calculated by dividing the number of correct answers by the number of participants. The last row of the table presents the total number of correct answers for all signs. The average intelligibility scores are calculated by dividing the total number of correct answers by the total number of all answers (33 participants times 18 signs = 594).

Table 3 ISIT results

The results show that, for the majority of the tested signs (12 out of 18), the intelligibility scores of the avatar are equal to or greater than the rates for the video of the real interpreter. The average intelligibility score for the videos is 0.939 and score of the animations is 0.92%. A nonparametric statistical Mann–Whitney U-test was applied to test the null hypothesis \(\mu _v\) = \(\mu _a\) , where \(\mu _v\) is the intelligibility score for the video of the real interpreter and \(\mu _a\) is the intelligibility score for the avatar. The Mann–Whitney U-test resulted in Z \(= 0.52204\) and p value \(= 0.60164\), meaning that the difference between \(\mu _v\) and \(\mu _a\) is not statistically significant. Considering the alternative hypothesis \(\mu _v > \mu _a\), the Mann–Whitney U-test resulted in Z \(= 0.52204\) and p-value \(= 0.69918\), meaning that statistically it is not possible to state that the intelligibility score of the video is greater than that of the avatar.

Besides the 18 signs, the fingerspelling of two names by the avatar was also evaluated. We only tested two situations together with the 18 isolated signs to keep the duration of test to around 20 min. The results for the dactylology yielded 32 correct answers for the name UNICAMP and 30 for ALICIA, from a total of 33 answers. The results suggested that the avatar’s fingerspelling mechanism provides meaningful information.

5 Concluding remarks

Following the framework for action put forward by UNESCO’s “Salamanca World Conference on Special Needs Education,” Brazil instituted a model of inclusive education through a law which established guidelines and foundations for national education [6]. This law recommends that deaf students be offered opportunities to be included in the mainstream school system. Moreover, the Brazilian government recognized Brazilian Sign Language as a language of the communities of deaf people in Brazil [7] and established that the education of the deaf deserves a bilingual approach in which the teaching of deaf students through their first language, Libras, is ensured. Additionally, schools must teach deaf children the written form of Brazilian Portuguese as a second language [8].

However, despite favorable legislation, in practice the inclusion model left some issues open regarding the implementation of the bilingual approach in schools, resulting in many schools lacking teachers with bilingual training to deal with deaf students. Other aggravating factors which may be mentioned refer to a lack of professional Libras translators and interpreters to accompany deaf students in the classroom, as well as a lack of appropriate teaching materials for deaf education.

The present project brings us back to the traditional discussion about pedagogical questions among those who work with minority groups, the deaf among them; that is, bilingual teaching. We can hardly speak about bilingual education in a context in which both languages in contact have different values in society and one has much more space that the other. This leads to the conclusion that deaf children fail in school not because they are deaf, but because their language is different from the official language of the school.

In this context, the present work focuses on the translation of educational material aiming to improve the bilingual education experience for deaf children, facilitating the comprehension of written Portuguese and fostering the mastery of sign language. Among the contributions of the present work, we highlight first the implementation of a methodology to compile a source language corpus which not only focuses on their suitability for automatic translation but also includes relevant criteria for the education of deaf children. Second, the project promoted the construction of a comprehensive parallel Brazilian Portuguese/Libras corpus, with more than 2000 sentences, which is also contributing to advance the studies on Libras, including the systematic description of its grammar. Third, we presented a text-to-sign language translation architecture based on an Intermediary Language to drive a signing avatar. Finally, we have shown that the proposed animation methodology is capable of conveying intelligible signing.

The components of the proposed architecture and the integration between Translation and Animation Modules are in the prototyping and evaluation phases. The Intermediary Language is being evolved to include the indication of non-manuals, such as facial expressions, which convey important information for adequate signing intelligibility. In addition, we plan to evaluate the quality of the translation component considering two complementary perspectives: software engineering and language effectiveness. From the point of view of software engineering, two quality attributes will be considered [47]: performance and scalability, i.e., the system’s ability to meet efficiently to an increasing demand of simultaneous accesses. From a linguistic point of view, the translation component will be evaluated using the Corpus Linguistics approach [5], which provides a means to evaluate the translation efficiency by using translation examples (corpus) as benchmarks. Finally, we are also working on the definition of a comprehensive evaluation protocol in order to assess the system from the perspective of deaf students.