Keywords

1 Introduction

Human poses and human motion are essential components in video footage, particularly in activities related to performing arts. The tool explored in this paper proposes applying multimodal annotation input/output with AI algorithms for pose estimation to improve Human-Computer Interaction [1,2,3]. MotionNotes development was part of the EU-funded project called CultureMoves [4], which follows a user-oriented approach and has the primary goal of developing software tools to “access and augment educational and cultural content” such as the one contained in Europeana [5]. MotionNotes enables its users to record, replay, and add new information to the video footage by working with multiple annotations and machine learning algorithms to increase productivity and creativity.

New AI techniques such as 2D human pose estimation are becoming very reliable in recent years [6], and they can be applied with promising results in Human-Computer Interaction and Multimodal Systems.

Therefore, three main research questions have been posed while developing and testing this Web-based tool as an input interface to add and manipulate multimodal time-based annotations over video:

  1. 1.

    Is it preferable to carry out annotation work during or after recording?

    1. a.

      Moreover, for each mode, what are the differences in annotation type usage?

  2. 2.

    Regarding the user devices used for annotation work, is there a preference between laptop and mobile devices?

    1. a.

      Additionally, for each device type, what are the differences in annotation type usage?

  3. 3.

    Could the human pose estimation feature be an asset to users when carrying out annotation work?

This work will contribute with a preliminary evaluation of the prototype and its interactions by answering these research questions, which brings insights regarding users’ preferences. We collected this feedback through questionnaires and informal interviews. As a result, we concluded that our users accepted the general idea of replacing previous annotation methods with this web-based solution during our lab days. Moreover, we can state that people who work with video annotation are receptive to exploring different tools and interactions.

This paper is structured as follows. We start by analysing the related work, followed by the MotionNotes description. Afterwards, we present the testing environment and the results that were obtained. Finally, in the last section, we conclude with a summary, highlighting the tool’s potential, and plan the future work.

2 Related Work

Video annotation is a valuable resource in different application areas, including analysing and studying human body motion. Furthermore, they are essential tools for encouraging collaborative teamwork by enabling information sharing [7]. These reasons motivated the development of several tools over the last years.

ELAN [8] is one of the most well-known and used tools in manually annotating or transcribing non-verbal communication. The work of Goldman [9] explored video annotations with object tracking methods. However, this work does not support touch or pen-based annotations; the tracking feature could not perform in real-time. The Choreographer’s Notebook [10] was designed specifically for Choreography workflow, allowing digital-ink and text annotations. The WML tool [11] is another Web-based tool specifically designed to annotate, archive, and search dance movements.

In contrast, a pen-based video annotation tool was developed by Cabral et al. [12] to track motion. Their solution used frame differences, and they later tried similar methods on video editing [13]. Silva et al. [14] presented a work that enables real-time object tracking using the same pen-based video annotations following the same path. After that, as part of the BlackBox project [15], a prototype was developed to experiment with annotations in a 3D environment using Microsoft Kinect. Commercial video annotation applications, such as Wipster [16], Camtasia [17], Frame.io [18], and Vimeo Pro Review [19], have simplified the process of annotating and sharing videos for users. However, none of them supports automatic human pose detection.

Human pose estimation is a valuable computer vision technique in several areas, such as gaming, virtual reality, and video surveillance. This technique seeks to detect the human body parts computationally from video frames, and the goal is to identify the head, elbows, shoulders, knees, hips and feet. To address this issue, a few approaches have been proposed over the years.

By the end of the 2000s, state of the art was based on algorithms using features selected by human specialists, like gradient histograms [20,21,22]. Later, deep learning techniques have motivated a great deal of attention over the AI community [23], and human pose estimation was no exception. Deep learning-based methods can extract more and better features from training data, being possible to find literature with superior results [24,25,26,27,28]. Our proposal will explore implementations based on this last technique.

3 MotionNotes

MotionNotes [29] is a web-based real-time multimodal video annotation tool based on keyboard, touch, and voice inputs. This tool can support professional and amateur users working on creative and exploratory processes. MotionNotes enables the capture of multimodal annotations while and after recording video. The annotation types available to be used can be text, ink strokes, audio, or user-configured marks.

3.1 MotionNotes Implementation Overview

The prototype was designed to run on any regular Web browser, exploring multiple input modes, such as keyboard and touch interaction. The interface is responsive in order to enable users with different screen sizes to enjoy adequate interaction. The MotionNotes user interface has a video display area in the centre of the screen where it is possible to add new annotations or update current ones. In order to improve user feedback, there is a graphical representation of all annotated moments right below the video area (Fig. 1). Moreover, we included a machine learning technique in MotionNotes to perform real-time human pose predictions. PoseNet [30], a pre-trained neural network, in conjunction with tensorFlow.js, is used to process the body part classification in the client’s machine. Predicted points are drawn on an HTML canvas object located in the same position as the video but at a higher layer position. Finally, straight lines are calculated between the points, with the skeleton staying visible, giving the user another resource to identify possible annotations (Fig. 2).

Fig. 1.
figure 1

Tool GUI and annotation types (Green: drawing; Red: text; Blue: mark) (Color figure online).

3.2 MotionNotes Interaction Example

After an initial interaction and reflecting on testing scenarios, a procedure was created for MotionNotes. In order to better understand the interaction, we follow Mary while she annotates her video with the MotionNotes.

First, Mary opens MotionNotes as she would like to load a video that she recorded in a recent dance competition. Mary goes to the File menu and clicks on the import video option; MotionNotes immediately opens a new window. Mary browses and selects her video. Once done, she needs to click on the play button, so the video instantly starts playing. She notices that the right-hand movement could be better and feels the need to highlight this. Mary pauses the video and then thinks about which annotation types are appropriate to express her thoughts. She decides to add one text annotation; to do that, she selects the text annotation type in the left menu and clicks above the video on the preferred location. MotionNotes immediately creates a new textbox and give focus to it. Next, Mary starts typing, “Right-hand position improvements were needed here”, and clicks enter. MotionNotes saves several details about the annotation, for instance, the text, font, colour, position over the video, and the exact timestamp. To give more details about the objective, Mary decides that a draw annotation could help. She clicks on the draw annotation option, and MotionNotes activates the draw functionality; Then, she creates a line across the location where the arm and hand should be. Finally, Mary believes that this section is crucial for the performance and adds a mark annotation. For that, she selects the mark annotation, and MotionNotes opens a popup with the predefined images. She selects a key icon and concludes the procedure by clicking above the video on the desired location where the icon should be (Fig. 1).

Fig. 2.
figure 2

MotionNotes with pose estimation. (Green: pose predictions; Red: manual annotation) (Color figure online).

4 User Study

The study was composed of three phases, each one to understand a different question. The study’s first phase focused on collecting data regarding the annotation experience both while and after recording. The second phase was focused on understanding the user behaviour regarding different devices. Finally, the last one assessed if human pose estimation could add value to video annotation software.

4.1 Design and Participants

We performed a user study with 27 participants. The user test started with the users watching a 15-min tutorial. The next step was for participants to interact with the software, which was achieved by asking them to complete a set of proposed tasks.

Regarding the participants, the most representative age interval was between 25 and 34 years old. The gender representation was nearly even, with 51% female and 49% male. Regarding the testing group education levels, 37% had a master’s degree, 29.6% had a bachelor’s degree, 18.5% had studied until high school, and the remaining 14.8% held a PhD degree. Most of the participants reported they frequently annotate their work in some way (77.8%). The preferred method used to annotate is a regular paper notebook (63.6%), with the laptop devices being the second most popular way (36.4%), and mobile phones appearing right after (27.3%).

The questionnaire included 23 questions using the five-point Likert Scale. We used paired-samples t-tests and one-way ANOVA to analyse the feedback.

4.2 Results and Discussion

The three tables present in this section summarise the results, including descriptive statistics, t-test, and ANOVA regarding each one of the research questions posed in the introduction.

Table 1 focuses on results related to RQ1. Participants classified the tool regarding the annotation experience during and after a recording. The t-test returned a significant difference, showing preferences for using the tool in playback mode. Moreover, the ANOVA test has shown a substantial difference concerning this after recording mode, indicating that text was the most popular annotation type, followed by drawing.

Table 1. Descriptive statistics, t-test, and ANOVA for the different annotation type experience.

Regarding RQ2, classifying the experience when using different devices, a mobile touch-based device (less than 576 px wide) was compared to a regular laptop. Again, the t-test returned a significant difference, showing preferences for using the tool in a regular laptop with larger resolutions. The ANOVA test did not show significant differences regarding the usage of different annotation types. However, it is possible to verify that text annotation is slightly more prevalent when using laptops, while mark annotation leads in mobile. Table 2 summarises the results.

Table 2. Descriptive statistics, t-test, and ANOVA for the device interaction experience.

Regarding RQ3, which addresses classifying the overall experience with the human pose estimation feature active, the feedback was positive, as shown in Table 3. However, when users were asked if they could consider using this feature during annotation work, the results were just ok. Users’ comments about this feature were collected, which let us understand they were expecting more options to work with it, such as recording only the pose or reproducing the movements in isolation (e.g., without video and sound) and adding annotations in this mode.

Table 3. Descriptive statistics, t-test for the pose estimation experience.

The user statements during the test were mostly positive. One user (U3) said: “Easy to learn; the multiple annotation types complement each other very well.” Another user (U15) stated: “The marks was a good idea, very fast to apply, even in small screen devices”. Regarding human pose estimation, another user (U24) stated the following: “There were scenarios where having the pose helped in the creation of new annotations”.

5 Future Work

The feedback obtained while testing MotionNotes was positive. However, we discussed and collected a couple of new ideas for additional developments.

Participants mentioned a few scenarios in which they considered human pose estimation and MotionNotes could benefit in a future version. The first scenario discussed by some participants was the background subtraction. That means, for instance, reproducing the body parts motion on a skeleton format in the same timeframe as the source video, but with a clean background. Additionally, the skeleton colour, background colour and audio should allow personalisation. The second scenario was based on having a particular type of annotation associated with the pose. This type of annotation should be optional and could be activated or deactivated depending on motion tracking status. This scenario brings several advantages like users could concentrate only on pose in a specific annotation iteration, leaving other elements for future work; another advantage is the movement correction, where users could edit several frames by drawing the correct pose.

Regarding the annotation types, the most discussed were the marks, where participants showed great interest given its novelty. Again, we stimulated the participants to give suggestions and ideas, and 3D was the subject of debate for this type of annotations. Right now, these marks are predefined 2D icons or images uploaded by users, and participants commented about how interesting it could be to upload 3D models and add them to a scene as annotations. We think most of these ideas could foster the users’ creativity, which is one of our main goals, and we are already designing a new MotionNotes version containing some of these features.

6 Conclusion

The multimodal Web video annotation tool MotionNotes described in this paper enables users to add different annotation types and identify human poses in real-time. The tool was tested in order to address three main research questions concerning user preferences and interaction.

From our results, we can conclude that annotation software users prefer to work after the recording session and not during it. Additionally, they preferred to work in a traditional environment with a larger screen over the more modern and popular mobile devices. Finally, we observed a significant curiosity about both automatic human pose recognition and marks annotation type. Future work should focus on these features providing additional research.