Keywords

1 Introduction

The current study examined how to use crowdsourcing to convert sign language to text. Generally, in Japan, a sign language interpreter reads and vocalizes a speaker’s sign language, and caption typists generate captions from the vocalization. However, this method doubles labor costs and delays the provision of captioning. Therefore, we developed a system that interprets sign language-to-caption text via crowdsourcing, with non-experts performing interpretations. Here, a non-expert is defined as a person who can read sign language with no experience as a captioner or otherwise.

While many individuals classified as deaf/hard-of-hearing (DHH) who can read sign language are suitable workers for this task, not all of them possess adequate typing skills. To address this, our system divides live sign language video into shorter segments, distributing them to workers. After the worker interprets and types the segments to text, the system generates captions through integration of these texts. Our system can establish an environment that not only allows the interpretation of sign language-to-caption text, but also provides an opportunity for DHH individuals to assist those that are unable read sign language. In this report, we describe a prototype system for sign language-to-text interpretation and provide results from the evaluation of the experiment.

2 Related Works

2.1 Speech-to-Text Captioning via Crowdsourcing

Communication access real-time translation (CART) [1], which uses stenography and a special keyboard, and C-Print [2], which uses a common PC, have been developed to provide real-time speech-to-text captioning. In Japan, IPtalk [3] is often used to make captions by having typists collaborate with each other. The web version of captiOnline [4] has also been used recently. Furthermore, SCRIBE [5] is a real-time captioning system that uses crowdsourcing. The system divides the audio stream into equal length segments. A worker, who is a non-expert, types the text from an assigned segment. Texts generated by workers are merged and provided to users as captions. By overlapping adjacent segments, words appearing in the text will be partially the same. A caption with the fewest number of errors is generated by merging them together. In addition, TimeWarp to improve performance of captioning task was also proposed [6]. The idea is to slow down the playback speed while working on a task and speed up the playback speed while waiting. This mechanism works well when there is a large difference between the worker’s typing speed and speaker’s talking speed. However, it is not obvious that this method is effective for sign language. One of the reasons is that, unlike transcription of speech, transcription of sign language requires translation. That is, depending on the words selected for translation, it may not be possible to merge them.

2.2 Sign Language-to-Text Captioning via Crowdsourcing

A study on direct sign language-to-text interpretation via crowdsourcing has been conducted [7]. Since sign language captioning requires translation that is distinct from speech transcription, it is difficult to use textual merging with overlap between segments. So, one worker takes charge of the task of creating segments in semantic unit. A segment is assigned to a group of workers. Each worker in the group interprets the segment and types text, that is, multiple texts are generated in the group. One text representing the segment is selected from the group by a vote of workers responsible for evaluating the quality of the texts. However, the workload of workers typing text and the quality of the caption depends on the skill of each worker creating segments. Because signs are read visually, workers sometimes overlook them while typing. Since the system is based on live speech in sign language at the scene, workers are not supported when they overlook signs.

Four studies on the expansion and improvement of this system have been reported.

The first method was to give workers as little stress as possible while swapping the number of workers in a group to equalize the number of workers in a worker group when they joined or left the group [8]. The second method was to swap workers in a worker group while balancing the skills of workers as they joined and left the group [9]. The third method was for a group of workers to execute the task of interpreting text and, if incomplete, complete the words that were not written by referring to other workers’ translations [10]. The fourth method was to assign tasks to workers that were commensurate with their abilities while keeping the number of tasks assigned to workers as even as possible [11]. However, these studies have not improved the workload of the worker creating segments or prevented sign language oversight while typing.

3 Captioning System

3.1 System Structure

First, our system divided a live video of a sign language speaker into shorter and equal length segments, distributing them to workers via crowdsourcing. Second, a worker interpreted the sign language performed in a given segment, and created the appropriate text by typing it as a crowdsourced task. Finally, the system generated captions through integration of the texts created by workers. Workers were inserted into a queue to accept interpretation tasks. Workers who finished a given task were requeued. Therefore, faster workers received an more segments.

Shorter segments of sign language video made it easier for workers to create text, even if they were not adept at typing. Furthermore, we provided a user interface for playback speed control and one second rewinding in order to improve the ease in which tasks were completed.

Sign language video segments were automatically edited to equal lengths within our system, and the task of each segment was assigned to a different worker. Therefore, to eliminate missing text between segments, our system overlapped adjacent segments for a period of one second. Furthermore, the worker’s task environment included live typing progress for workers who were assigned to the previous and next segments. This allowed for collaboration between workers, and we could expect natural connections between segments.

The system architecture was designed with YouTube Live [12] and a task control page, task execution page, and caption display page (see Fig. 1). The task control page divided segments of the speaker’s live sign language video and assigned tasks to workers. The task execution page provided a user interface for interpreting and typing sign language while the worker watched a live sign language video of the speaker. The caption display page displayed the results of a worker’s captions for an assigned task on the screen in the order of the task. The system’s server was implemented in the platform Node.js [13]. The client was created via the website and used the YouTube Player API [14] to allow the website to view and control the video.

Fig. 1.
figure 1

A real-time sign language-to-text interpretation system architecture

3.2 Task Control Page

When the speaker clicked the “Start” button on the “Task Control” page, it started to divide the sign language video into segments of predetermined length and assign tasks to workers (see Fig. 2). At that time, the video playback time of the workers’ interface was synchronized with the video playback time of the task control page. The segments were all set to an equal length. In this system, the real time was based on the playback time of the task control page. Workers in the worker queue were waiting to watch the video in real time. A task was assigned to a worker de-queued in the worker queue and that worker started the task.

Fig. 2.
figure 2

Task control page structure

3.3 Task Execution Page

The task execution page provided a user interface to facilitate the worker to interpret and type the sign language (see Fig. 3). When a worker was assigned a task, the worker typed in the current worker’s text field while the worker watched the speaker’s sign language video. The previous and next segments displayed the live typing progress of a worker and allowed them to see missing or duplicate characters between segments. Workers alternated between reading sign language videos and typing text, sometimes overlooking the speaker’s sign language. To address this, there were buttons and shortcut keys for the video controls, and the functions were 1-s rewind, pause, and playback speed control. Instructions on what the worker should do were displayed in the instruction frame. There are three types of instructions: “Start your sign language to text task,” “Fast forward and wait until you make up for the delay,” and “Wait for next task.” “Start your sign language to text task” instructions were displayed in the instruction frame when the worker was assigned a task. Then the worker started interpreting and typing the speaker’s sign language. “Fast forward and wait until you make up for the delay” instructions were displayed in the instruction frame when the worker completed the task. When workers used the video control functions, there was a delay from the live speech of the speaker. Workers made up for the delay with the fast forward playback based on the playback time of the video on the task control page in order to get the context. “Wait for next task” instructions were displayed when the worker caught up. Then the worker waited until the next task was assigned.

Fig. 3.
figure 3

Task execution page for sign language-to-text interpretation.

4 Experiments and Results

We conducted a test using our prototype system for sign language-to-text interpretation. Workers were four university students with DHH, who could read sign language. The workers had been using sign language for over 7 years. The video used for interpretation was 6 min, 54 s in length. Each segment of video was set to 9 s in length. The 9 s length was determined by the mean delimitation time of the task in a previous study [7]. When a worker was given the instruction “Fast forward and wait until you make up for the delay”, the video playback speed was 1.5 speed. Before the test, the typing speed of the workers was measured. We analyzed workers’ behavior logs in the task environment and their responses to questionnaires.

The mean Japanese character typing speed of workers was 1.9 CPS (characters per second; SD 0.3 CPS), with the highest and lowest being 2.4 CPS and 1.6 CPS, respectively (see Table 1). The mean time it took for workers finish tasks was 26.0 s (SD 13.1 s). Missing text and collision rates between segments were both 33%. The mean number of times a worker was assigned to a task was 12.3 (SD 5.4) with the highest and lowest being 21 and 7, respectively (see Table 2).

Table 1. Data of workers (mc is the mean typing speed of the workers. ms and σs are the mean and standard deviation of the workers’ task finished times.)
Table 2. Task execution count and subjective evaluation (nt is the number of times of tasks executed.)

Workers were asked in a questionnaire about the enjoyability of the task, with 1 = negative and 5 = positive. Results were one “2 = weak negative”, two “3 s = neither”, and one “4 = weak positive” (see Table 2). Only one worker reported checking the live typing progress of the previous and next segments to ensure a good connection between the segments.

“Ref. of next and prev. segment” was measured by the response to “Did you check the live typing progress in the previous and next segments?” “Enjoyment of task” was measured by the response to “Did you have a good time?”

5 Discussion

The mean time it took a worker finished a task was 26 s in a 9 s segment. In other words, workers took approximately three times as long as the segment’s length to finish the translation to transcription task. Analysis of workers’ behavior logs showed that slow typing was a major factor in the time it took. Workers with good typing skills finished tasks faster, whereas those with poor skills finished slower. In the future, we would like to try to experiment with methods to dynamically control the time of a task so that each worker’s delay time is the same as the working memory retention time.

To reduce the text missing between the segments, we overlapped adjacent segments for one second, and allowed workers to check the live typing progress of previous and next segments. However, the combined total rate of missing text and collision between segments was 66%. After surveying the questionnaire, only one worker checked the progress of live typing in the previous and next segments. It is still considered that non-expert workers did not have time to check for missing or collisions between segments and sign language-to-text interpretation at the same time. In future studies, we would like to try to reduce these problems by adding a new worker task to revise text for missing segments and collisions and errors. In addition, we would like to evaluate the quality of the caption generation to demonstrate the feasibility of a sign language-to-text interpretation system.

Because only four workers participated in this experiment, all workers were immediately assigned to new tasks after being requeued. Given this, there was an approximate three-fold difference in finished tasks between the fastest and slowest workers. Analysis of questionnaire responses found that workers assigned fewer tasks considered the tasks more enjoyable. To determine the workforce required for optimum enjoyment, a simulation was conducted based on the slowest worker. We estimate the number of workers required by our system. The required number of workers Nw can be expressed as the sum of the time Tt that a worker spends on the task and the time Tc spent on the chasing replay afterwards, divided by the time of the segment Ts:

$$ Nw = {{\left( {Tt + Tc} \right)} \mathord{\left/ {\vphantom {{\left( {Tt + Tc} \right)} {Ts}}} \right. \kern-0pt} {Ts}} $$
(1)

where Tc = (Tt − Ts)/(Rc − Rs), where Rc is the playback speed of the chase. Rs is the standard playback speed, i.e., 1x. Based on the slowest worker, Nw = 14.85, with a worst-case scenario of about 15 workers needed (Tt = 45.2 s, Ts = 8 s (because of the 1 s overlap), Rc = 1.5x). Results found that 15 workers would be required to reduce workload and increase enjoyment. In the future, we would like to evaluate workers’ impressions by measuring system usability scale of worker interfaces.

6 Conclusion

We developed a system that interprets sign language-to-caption text via crowdsourcing, with non-experts performing interpretations. We conducted a test using our proto-type system for sign language-to-text interpretation.

The test results showed three findings: first, the slow typing of the workers was a major cause of slower workflow; and second, the non-expert workers did not have time to check for missing and collisions between segments of text and interpret from sign language-to-text at the same time; finally, 15 workers would be required to reduce the workload and increase workers’ enjoyment.

In the future we would like to achieve four things. First, we would like to experiment with a method to dynamically control the time of the task so that each worker’s delay time is equal to the working memory retention time; second, we would like to try to mitigate these problems by adding a new task for workers to revise the missing text and collisions between segments, and text errors in the task results; third, we would like to evaluate the quality of the caption generation and demonstrate the feasibility of a sign language-to-text interpretation system; finally, we would like to evaluate worker impressions by measuring the system usability scale of the worker interface.