Keywords

1 Introduction

Cybersecurity is a field that is continually garnering much interest as it is now pervasive in the everyday lives of people. To educate the public and train prospective security specialists, the academic community and industry has been responding by developing new programs in information assurance and devising creative ways to attract and train the next generation of cybersecurity professionals [1,2,3]. One such means has been to create a cybersecurity competitions [4]. These competitions serve to educate its participants and spectators alike. However, the number of student participants engaged in these competitions has still been relatively small. One of the reasons for this lack of interest may be attributed to the fact that, to date, the competitions have been beneficial mainly to anticipants, such as student teams, cybersecurity experts, and administrators and judges of the games. However, for an audience of the sport, the excitement and educational value do not necessarily transfer. This may be because there are audiences who are not experts in the field, and the information and visualization tools are too high level for them to understand.

To help the audiences to comprehend cybersecurity competitions and encourage their engagements at these events, many educators use video game format [5] to educate their participants. Researchers have observed that audiences become engaged in the activity and gain educational content vicariously [2, 6, 7].

Fig. 8.1
figure 1

Logical view of the LUCID visualization system

Considering the effectiveness of applying video game in education, this paper presents a system to build a game-like virtual commentator that aims to help non-expert audiences comprehend the concepts and engage audiences in cybersecurity-related events, and therefore, promote cybersecurity education among non-expert audiences. Specifically, the commentator is able to: (1) perform various human-like behaviors which range from various valenced facial expressions to gestures, gaze and emotions conveyed in speech, (2) interacting with audiences, and providing several types of feedback including causal, congratulatory, deleterious, assistive, background, and motivational responses, and (3) when producing speech, the lips, tongue, and jaw provide visual cues that complement auditory cues. In addition, the virtual commentator is flexible enough to be employed in the Collegiate Cyber Defense Competitions environment.

NSF mainly supports this research “LUCID: A Spectator Targeted Visualization System to Broaden Participation at Cyber Defense Competitions”. The proposed virtual commentator is built as part of the LUCID framework shown in Fig. 8.1. The goal of this project was to stimulate interest in cybersecurity competitions through the development of a visualization and broadcast system targeted to enhancing learning and understanding among spectators. The LUCID framework is comprised of five basic subsystems: a computer and Networking data capture, analysis, and storage subsystem, a computer and networking visualization system, a camera and audio selection subsystem, a broadcast subsystem, and display subsystem. Each system collects their respective data and in concert with one another relaying the necessary information in some fashion to the spectator [2].

The animated commentator subsystem, highlighted by the red box in Fig. 8.2, takes video, images, and audio information, as well as computer and network visualization information, to generate semantic tags, that control various behaviors from facial expressions to gestural motion of animated agent to make it believable, personable and emotional. These behaviors are broadcasted and displayed on the camera and audio subsystem.

Fig. 8.2
figure 2

Animated commentator subsystem architecture

The rest of this paper is organized as follows: Sect. 8.2 describe the architecture of the commentator subsystem. Section 8.3 demonstrates some preliminary development results. Section 8.4 reviews the related work. Section 8.5 concludes the paper.

2 The Commentator Subsystem

As stated earlier, the commentator subsystem in the LUCID framework aims to engage the spectators in a similar fashion [8] as with video games. The commentator subsystem focuses on the visual and auditory component in terms of a virtual environment.

2.1 Virtual Commentary Architecture

The architecture of the animated commentator subsystem is shown in Fig. 8.2.

The core of the architecture consists of three components:

  • An integrated hardware and software interface for acquiring motion data from a variety of hardware device, including Kinect v1 camera and VMG 30 data glove.

  • A software system to build a fully rigged behavioral model for the commentary based on motion data.

  • An animation software to combine face, body and hand tags for the agent’s behavioral models.

Fig. 8.3
figure 3

Annotated video clip using ELAN and the associated annotation tiers

For modeling different affective states exhibited by most engaging sports casters, we develop a system to generate images in the 3-D scene in two steps. Firstly, in the modeling step, the system produces a precise description of the agent, in terms of graphics primitives. Then it acquires the vertex data required for the primitives from 3-D modeling software that can generate vertex data. The data provided for the development of the animated agent contain over one thousand points in 3-D space. Secondly, in the rendering step, the vertex data also serve to draw the model to the screen.

For the system to simulate a smooth transition in the model’s facial expression, as well as different full body gestures, our method generates a sequence of vertices given that only known vertices are the start and end vertices. By studying and annotating the video clips we collected, we can model realistic animation patterns. With the help of third-party applications, the animation is natural-looking and includes head and face movements, combined with the movements of facial components (eyes, eyebrows, nose, cheeks, etc.) and full body movements (arms, hands, fingers, torso, legs, etc.).

2.2 Motion Generation

For the agent to simulate human commentator behavior, initially, we used video footage of various contact sports commentators. Sample footage of commentator came from NFL roundtable discussions, super bowl commentators, and Apollo Robbins’ TED talks. Each video file is roughly 9 min in length. These contain commentator motion data that can be applied to the animated agent. The videos used in this project were sourced from YouTube.com. To further analyze behaviors of commentators from non-contact sports, especially from cybersecurity events, we collected data from the Maryland Cyber Challenge. In 2014, we attended one such event in which we interviewed (and recorded) one of the sponsors of the competition. The interview lasted 30 min, and from that, we created frame-by-frame shots to aid in the analysis of facial and body gestures. Again in 2015, we performed a similar data collection, which provided us with data 1 h and 30 min in duration.

We observed several motions/gestures across all the collected data. To annotate the footage, we used a tool called ELAN [9]. With ELAN [9], we created three tiers to represent the different commentator behaviors we analyzed. Further to that we added another tier to represent the dialog phase of the interaction between the interviewer, spectators, and event sponsors. Figure 8.3 shows the interface of ELAN and the different tiers which allow the user to analyze different media corpora by adding descriptive tags that are either be time-aligned to the media or it can refer to other existing annotations.

From the analyzed video clips, we observed typical animation pattern for a commentator with individual personalities. We represent this animation pattern by using tagged meta-animation sequence. Based on behaviors of interest in the expression and gestural behaviors, we developed tagged animations sequences that correspond to the tier of behavior. To capture the requisite gestures and facial expressions used by sports commentators several hardware devices and software applications. Microsoft’s Kinect v2 captured face and body motion data, while the virtual motion glove—VMG30—captured relevant hand motion. To process the data stream from both the Kinect v2 and VMG30, we used the Brekel and VMG SDK to map the input motion data to the animated model.

2.3 Speech Generation

The animated agent is required to give commentary on the cybersecurity competition, and to give this commentary while maintaining human-like expressions, hand gestures, and movements. The facial expression demonstrated by the agent is dependent on sentences being narrated by the agent. Having this work seamlessly for the different scenarios that may arise based on the activities in the competition involves two parts. Firstly, a database that provides the appropriate triggers for the agent to speak based on different conditions in the form of text, and secondly, a system that takes this text and converts it into speech while using the appropriate facial expressions.

The database used in this system is the MySQL database. For our system, the events happening during the competition, mainly the services that are up or down (up indicating this it is currently still running, and down indicating that the service has been compromised and shut down), cause data to be stored in the MySQL database. This data contains keywords that the agent needs to narrate. We created scripts that query this database whenever the virtual agent needs to announce information to the spectators.

The second part of the system is the TTS. For this part of our system, we used several third-party systems to provide speech functionality and to ensure that the text spoken by the agent also creates appropriate lip movement and facial expressions to generate a full and realistic virtual commentator. We use two separate software for the text-to-speech because of the different advantages they provide. MaryTTS [10, 11] is an open-source, multilingual Text-to-Speech Synthesis platform written in Java. Being open-source means that it is available for free online. We chose this because of its availability, user-friendliness, and ease of use. The result of the query, i.e., the text, is inputted into this software to generate the necessary phonemes contained in each word. A phoneme is the basic unit of sound in any word in a specified language. The phoneme is used to determine the way the agent’s lips move. Phonemes affect how words are pronounced, things like accent, the shape of the mouth, language all affect the way the mouth and lips move. This phoneme list is generated and then used to facilitate the mouth shapes of the agent when pronouncing each word. The second software is CereVoice [12, 13], which is another TTS software. This is also open source and available through different license tiers. We use CereVoice because it produces a much more natural sounding synthesized speech than MaryTTS [10, 11]. It also takes in the same text as input and is responsible for producing the final audio that will be heard by the spectators. To tie the model’s speech capabilities to its motion, we use the Lip Sync pro system [14] to provide a window for synchronizing phonemes generated, emotions and gestures to dialogue in an audio file, and a component for playing back these dialogue clips on a character using totally customizable poses.

3 Preliminary Results

Our preliminary results include: (1) developed expressions/gestures of an animated virtual commentator and (2) generated a marked-up animation sequence.

3.1 Generating Expressions and Gestures

Figure 8.4 shows some of the expressions and gestures of an animated virtual commentator we have developed to engage spectators in the Collegiate Cyber Defense Competitions (CCDCs) environment. By recording and observing the interactional behaviors of different professional human commentator working with various groups of spectators in a set of CCDCs, we have identified specific coordinated verbal and nonverbal signals exchanged during the class session. The nonverbal cues include head nods, the direction of eye gaze (including eye contact), facial expressions, posture shifts and hand and arm gestures. Once these have been codified and analyzed to determine their temporal organization and contexts of occurrence, we conducted several experiments and noted the common behaviors exhibited by the observed commentators and mapped those behaviors unto the animated agent.

Fig. 8.4
figure 4

Diverse expression of an emotional embodied conversational agent. Top row: smile with teeth, smile closed mouth, and sadness. Middle row: surprise, anger, and disgust. Bottom row: fear, pursed lips and pursed lips with closed eyes

The current interface, while independent of other external systems, as seen in Fig. 8.5, contains the current virtual environment for one of the virtual commentators created for the system. The current scene is a 3D mock-up of the bridge of the starship Enterprise from the popular science fiction show Star Trek™: The Next Generation. On initialization of the system, the agent checks if it has established a connection to the MySQL database that houses the data collected from the network monitoring system. If there is any connection issue, the agent reports that to the spectator in randomly selected over emphasized gestures with matching speech generation as seen in Fig. 8.6.

Fig. 8.5
figure 5

System interface with a virtual agent in an idle animation

Fig. 8.6
figure 6

Top: Model behavior for successful MySQL connection by showing an exaggerated thumbs up gesture and returning to the idle animation. Bottom: Model behavior for failed MySQL connection by showing disappointment by shrugging its shoulders

3.2 Generate a Marked-up Animation Sequence

To generate a marked-up animation sequence or to add additional markup to a sequence, we need to insert a series of animation tags to the phoneme.

Here is an example to describe how it works as seen in Table 8.1. Suppose the input text string is the prompt “didn’t know you know so much,” with associated phoneme string and animation sequence and, that a meta animation sequence will be imposed on this animation sequence automatically.

Table 8.1 Phoneme generation with time stamps and no behavioral tags

Using for example, the meta animation sequence example: 0.2 s—<FET = SMILE>—0.2 s—<EYEB = (3000, B)>—0.2 s—<HGT = (−2,0,0)>—0.3 s—<HGT = (2,0,0)> <FET = RAISEEYEBROW>—0.3 s—<HGT = (4,0,2)>—0.3 s—<FET = SMILE>. The smile target is inserted on a phoneme boundary close to the 0.2 s into the animation as specified by the meta-animation sequence but not at exactly 0.2 s as such, allowing semi-random behavior and ensuring temporal variability while ensuring speech synchronized behavior. The animation sequence may alternatively have specified that the tag be inserted only at a word boundary, probabilistically or in other ways.

Tags in the meta-animation sequence are inserted in order, and the movements from all channels are finally combined to generate a final animation sequence. For the previous example the result may look like Table 8.2.

Table 8.2 Phoneme generation with desired effective behavioral tags

4 Related Work

Cybersecurity competitions are becoming increasingly prominent parts of the landscape of cybersecurity education. Most notably is the National Collegiate Cyber Defense Competition (NCCDC) with its precursor regional events [6] and the UCSB International Capture the Flag Competition (iCTF) [7]. These are both large national and international competitions with scores of institutions and hundreds of students [4]. There are several other competitions held at regional levels [4] and those organized by other faculty and security specialists [7, 15,16,17,18,19,20,21].

As O’Leary noted [4], these competitions require the students a certain level of competence in defensive and administrative skills. The teams of participating students work to defend identical networks from a group of designated attackers. The team that successfully keeps most of their services on their network, as well as their network operational are declared winners of the event.

Most of these games favor the war game mode of hands-on exercises for participants. These war games are either organized as capture-the-flag (CTF) [22, 23], king of the hill (KOTH), defend the flag—a variation of KOTH, computer simulations and online programming-level war-games [24, 25]. The issue with these modes is there is no opportunity for the spectator to actively participate in the competition, granted they are prevented from engaging with actual participants of the competition. The lack of a system that involves spectators in the event continues to unintentionally exclude that potential population of security specialists in a field that suffers from reduced engagement from different groups. Hence, we propose the use of an effective pedagogical virtual agent with commentator status that engages the spectators.

5 Conclusion

This paper presents a virtual animated commentator architecture for cybersecurity competitions. We have developed a commentator based on the architecture. The commentator is able to generate expressions/gestures and a marked-up animation sequence. Specifically, we have identified a set of verbal and nonverbal signals and cues exchanged during the interactions between spectators and the commentator. To that end, we take advantage of the capabilities the Kinect v2 and VMG 30 glove to model those behaviors and map them to our 3D animated models to better simulate the realistic performance of human commentators. We anticipate that the system will ultimately improve the educational value and excitement for the spectator and broaden interest in the field of cybersecurity. Our future work is based on two hypotheses in terms of the audience’s interaction with the system and the method to educate and engage them. For the spectator, we hypothesize that the excitement and learning outcomes are associated with the ability to extract, visualize and comprehend the details of the game as they unfold will improve [2], thus opening cybersecurity events to a broader group. Secondly, the use of a pedagogically effective animated commentator will aid in the comprehension and engagement at these events.