Keywords

1 Introduction

Collaboration is an important skill in the 21st century [10]. It can take place in different settings and for different purposes: collaborative meetings [17, 36, 38], collaborative problem solving [34], collaborative project work [7, 8], collaborative programming [15] and collaborative brainstorming [37]. Some are in co-located and some in remote settings. “The requirement of successful collaboration is complex, multimodal, subtle, and learned over a lifetime. It involves discourse, gesture, gaze, cognition, social skills, tacit practices, etc.” [emphasis added] [35]. Moreover, in each context, the indicators of collaboration vary. For instance, in collaborative programming pointing to the screen, grabbing the mouse from the partner and synchrony in body posture are relevant indicators for good collaboration [15]; whereas in collaborative meetings gaze direction, body posture, speaking time of group members are more relevant indicators for good collaboration quality [17, 36, 38]. Thus, it is essential to understand what the different types of collaboration and their purpose are and what are the relevant indicators. These indicators help to formulate the intervention or feedback mechanism to facilitate collaboration [2, 5, 30]. Moreover, engaging in a collaborative task does not essentially build collaborative skills [12]; rather on-time feedback encourages self-reflection [23]. The type of feedback is also dependent on the goal of the task which can be to evaluate collaboration as a process [2] or collaboration as an outcome (indicated by learning gain) [30] or both [30]. To understand this in-depth, we have formulated two research questions:

RQ 1: What collaboration indicators can be observed and are relevant for the quality of collaboration during CC?

RQ 2: What are the state-of-the-art feedback mechanisms that are used during CC?

There has been a dearth of studies on automated multimodal analysis in non-computer supported environments [40]. Considering the time and effort required to build a sensor-based automated system which can also give real-time feedback, we chose to create a WOz research prototype which can integrate human observers and existing sensor technology. This enables us to study different CC settings with a variety of multi-source multimodal indicators coming from automated sensors as well as human observers.

The remainder of the paper is structured as follows: in the related work (Sect. 2) section we answer RQ 1 and RQ 2; it is followed by an explanation of our prototype design based on the WOz study (Sect. 3); this is followed by a discussion (Sect. 4) of the answers to our research questions; finally, a conclusion (Sect. 5) is drawn and we throw some light on future work and open questions to be answered.

2 Related Work

In this section, we will first analyze related work according to the different indicators used during CC from multiple modalities; and secondly review the different feedback mechanisms used during CC.

2.1 Multimodal Indicators During Co-located Collaboration

Different categories of verbal and non-verbal indicators have been used in the literature to measure collaboration quality ranging from tangible interaction, different speech-based cues, to gaze and eye interaction. Schneider and Blikstein [27] used Tangible User Interface (TUI) for pairs of students to predict learning gains by analyzing data from multimodal learning environments. They tracked the gesture and posture using a Kinect SensorFootnote 1 (Version 1) which can track the posture and gesture of a maximum of four students at a time based on their skeletal movements. They found that the hand movements and posture movements (coded as active, semi-active and passive) are correlated with learning gains. The more active a student is, the higher is the learning gain. Even the number of transitions between these three phases was a strong predictor of learning. Students who used both hands showed higher learning gains. Some of the activities that were logged by the TUI, like the frequency of opening the information box in the TUI can be correlated with learning gain. All these features were fed into a supervised machine learning framework to predict learning gain. Similarly, Martinez-Maldonado et al. [21] used TUI indicators for group work based on the log data generated and the gesture and posture of group members around the TUI.

Other works detected non-verbal cues during collaboration without a TUI. Stiefelhagen and Zhu [36] tried to detect the impact of head orientation on the gaze direction in a group round table meeting with four members. They found that on an average 68.9 % of the time head orientation can estimate gaze direction. Moreover, attention focus of group members can be easily predicted 88.7 % of the time using head orientation as the only input. Similarly, Cukurova et al. [7] performed a experiment on 18 members in six groups of three members each to detect non-verbal cues of collaboration using human observation. Hand position (HP) and head direction (HD) was a good predictor of competencies in Collaborative Problem Solving (CPS). They extended this work and formed the NISPI framework [8] using HP and HD as non-verbal indicators. These indicators were obtained during a prototype design by students (11–20 years old) using the Arduino toolkit. Then, they were coded for each student as: 2 (active) if a student is interacting with the object for problem solving, 1 (semi-active) if the head of the student is directed towards an active peer and 0 (passive) for all other situations. Using this coding, different collaboration dimensions like synchrony, individual accountability (IA), equality and intra-individual variability (IIV) were formed. High competencies of CPS was detected if high levels of synchrony, IA and equality is detected in the groups.

Speech-based cues are an integral part of any collaborative task. Lubold and Pon-Barry [19] found that proximity, convergence and synchrony are different types of coordination cues obtained from the speech features (like intensity, pitch and jitter) of the pair of students collaborating. It helped them to detect rapport between group members. It was observed from correlation analysis that proximity, convergence and synchrony measured using pitch can be a good predictor of rapport between the group members during collaboration. Students also self-reported rapport which was compared and collaboration levels were determined. Bassiou et al. [4] assessed collaboration among students solving math problems automatically. They used non-lexical speech features; thereby, preserving the privacy. They used a combination of manual annotation and Support Vector Machine (SVM) to predict the collaboration quality of the group. Types of collaboration marked are: Good (all 3 members are working together and contributing to the discussion), Cold (only two members are working together), Follow (one leader is not integrating the whole group) and Not (everyone is working independently). This coding was based on two types of engagement: simple (talking and paying attention) and intellectual (actively engaged in the conversation). They found that the combination of the speech-activity features (i.e., solo duration, overlap duration of two persons, overlap duration of all three persons) and speaker-based features (i.e., spectral, temporal, prosodic and tonal features of speech) are good predictors of collaboration. Simple indicators like the speaking time of each member can also be a good indicator of collaboration [2, 5]. Even a mixture of verbal and non-verbal indicators along-with physiological signals like skin temperature [24] can be a good collaboration indicator [18, 20].

Besides, eye gaze can be an indicator of collaboration quality. Some researchers [16, 25, 28] while using eye gaze analysis found that (JVA) Joint Visual Attention (i.e., the proportion of times gazes of individuals are aligned by focusing on the same area in the shared object or screen) is a good predictor of the quality of collaboration of a group which is reflected by the groups performance. Moreover, Schneider and Pea [28] showed that JVA can be used as a reflection mechanism in remote settings to show each student their partners gaze patterns in real-time to improve collaboration. Schneider et al. [30] got the same results by replicating the experiment in a co-located setting. The work by Schneider and Pea [29] used JVA, network analysis and machine learning to determine different dimensions of a good collaboration like mutual understanding, dialogue management, division of task, signs of coordination as outlined by Meier et al. [22].

Table 1. Overview of studies on co-located collaboration.

Moving on to the different purposes in which collaboration has been studied, Spikol et al. [33, 34] studied collaborative learning specifically in the context of Collaborative Problem Solving (CPS). They tracked the distance between hand movements and faces of group members. Later the recorded video streams were coded by experts with 0 (for passive), 1 (for semi-active) and 2 (for active) based on different combinations of head and hand positions for training the machine learning classifier for predicting the quality of collaboration. Recent work by Chikersal et al. [6] dives deep into the deep structure of collaboration in dyads. They found that synchrony in facial expressions correlated with collective intelligence of the group but not significantly correlated with the synchrony of electrodermal activity of members. Another work by Grover et al. [15] studied CPS in a pair programming context based on a pilot study. They captured data from different modalities (i.e., video, audio, clickstream and screen capture) unobtrusively using Kinect. For initial training of the classifiers using machine learning, experts coded the video recordings with three annotations (i.e., High, Medium and Low) when they found evidences of collaboration between the dyads. These evidences include pointing to the screen, grabbing the mouse from the partner and synchrony in body position. Later this classifier could predict the level of collaboration.

Moreover, post-hoc coding with the help of human coders has been a effective method followed for a long time to detect different indicators of collaboration. Davidsen and Ryberg [9] videotaped the work of pairs making a collaborative discussion around a touch screen measuring “The size of one meter”. The pair was trying to translate the design from graph paper to the touch screen to measure one meter. They found that body movements, language and gestures can be helpful to discover different facets of collaboration. Similarly, Scherr and Hammer [26] observed videotaped groups and identified four clusters based on the collaborative behaviour from both verbal and non-verbal indicators (like eye contact with peers, straight posture, clear and loud voice, etc.). Besides, some works [32, 37] considered epistemological aspects of collaboration during brainstorming where the number of ideas generated by each member was the indicator of quality of collaboration. Detecting individual attention levels in classroom from the responses to questions (i.e. epistemological) is also common [39].

In summary, collaboration indicators can vary from non-verbal, verbal, physiological to log files obtained from shared objects like TUI or computers. It depends on the context. Table 1 shows the overview of the multimodal indicators detected. We can find two types of co-located collaboration indicators, i.e., social (verbal, non-verbal and physiological) and epistemological (logs, ideas).

2.2 Feedback During Co-located Collaboration

Using these multimodal indicators, different feedback mechanisms have been developed in the past to facilitate CC. Kulyk et al. [18] designed a mechanism to give real-time feedback to participants in group meetings (with 4 members) by analyzing their speaking time and gaze behaviour. The feedback was in the form of different coloured circles representing attention from other speakers measured by eye gaze, speaking time and attention from listeners. This feedback was projected on the table in-front of where each participant was sitting using a top-down projector. They performed both quantitative and qualitative evaluation to evaluate the effect of the feedback: the feedback was accepted as a positive measure by most group members; use of feedback had a positive impact on the behaviour of group members as they had a balanced participation and improved eye gaze. Terken and Strum [38] used a similar setting and feedback mechanism; they discovered that the feedback on speech increased the equity of participation in the group. But, surprisingly feedback on gaze behaviour had little effect on the interaction pattern of group members. Similarly, Madan et al. [20] used sensors to capture nodding, speech features and galvanic skin response of dyads and built a real-time group interest index. This group interest index helped them to drive a real-time feedback. This feedback showed some group characteristics in different modes: individual PDA feedback, personal audio feedback, haptic feedback in the shoulder and public shared projected display. They studied these group characteristics in different contexts like speed dating and brainstorming sessions.

Some simpler versions of feedback which leverage the audio cues (like speaking time) during collaboration have proved effective in the past. For instance, Bachour et al. [2] performed an experiment to measure audio participation where each group (with 3–4 members) performed a task around a smart table. It gave them real-time feedback during the task by glowing different coloured LED lights for each member. The number of LED lights that glowed for each colour denoted the total speaking time for that member. They found that a real-time feedback helped to maintain the equity of audio participation among the members. Another similar approach was used by Bergstrom and Karahalios [5] with the help of a conversation clock. In this clock, different coloured concentric rings represented spoken participation of each member in the 4 member group. The bars and the dots in the ring denoted the length of conversation and periods of silence respectively.

Moving on to the epistemological aspect of collaboration, Tausch et al. [37] used an intuitive metaphorical feedback moderated by human observers during collaborative brainstorming. Three members in each group performed the task. The group members were supposed to discuss a certain topic and their collaboration was measured by the number of ideas generated. A comparison metric for collaboration such as a baseline was calculated as the average number of ideas generated by all members. Using this baseline, each group member was marked as below average or above average depending on the number of ideas generated by each member. Then the human observers controlled the public shared display which showed a metaphorical garden. Each group member was represented by a flower and the group was displayed as a tree with leaves, flower and fruit. The growth of the flower and the tree symbolized the participation (measured by the contribution of ideas) of the individual and the group respectively. More balanced participation was shown by a well grown tree with leaves, fruits and flower. If a group was having unbalanced participation for a long time then lightning flashes were shown in the group garden. Another example of feedback during collaborative brainstorming was implemented by Shih et al. [32]. It supports collaborative conceptual mapping to discuss a topic and organize the ideas.

Besides the use of visual and haptic feedback was effective in some collaboration tasks around a TUI. Anastasiou and Ras [1] gave real-time textual and haptic feedback to each group consisting of 3 members working around a TUI. The group members were needed to use different objects and find the desired power consumption using the TUI. At the end, they used a questionnaire and found that most participants of the experiment favoured the use of both visual and haptic feedback over audio feedback. Martinez-Maldonado et al. [21] used a TUI and gave real-time feedback on group performance for the teachers in tablets so that they can intervene when needed and can also make a post-hoc reflection after the task is over.

Use of external sensing devices to facilitate collaboration during meetings has proved its worth before. Kim et al. [17] used a sociometric badgeFootnote 2 which acted as a meeting mediator to capture audio and postures during meetings of 4 members in one group. This badge bridged the gap of dominance and increased the equity of participation among the group members using a real-time feedback on their personal mobile phones. This feedback showed a circle in the middle of a screen connected by four lines to small squares in each corner of the screen representing the individual group members. The colour and position of the circle denoted the interactivity of the group. When the group had a balanced participation then the circle was darker in colour and in the centre of the screen. The thickness of lines connecting the circle represented the speaking time of each group member. Apart from the personal mobile display to give feedback, Balaam et al. [3] used an ambient display showing a coloured circle visualization based on the non-verbal indicator of synchrony during a collaborative task of calendar planning. DiMicco et al. [13] used a shared group display to influence the speaking participation of each group member during a group activity.

Table 2. Overview of studies on co-located collaboration feedback.

In summary, most of these studies were in controlled conditions with small groups consisting of dyads and triads only. Table 2 shows the overview of feedback mechanisms used during co-located collaboration. Some real-time feedback mechanisms acted as a mere reflection for the group to self-regulate instead of an actionable feedback; while others used a post-hoc analysis for the teachers (or facilitators) to reflect on the group activity. The mode of display varied from a public display to smart phone display.

In a nutshell, most of the studies in related work are in controlled conditions and using specialized furniture, TUI and badges. These settings can be suitable for adhoc CC which can be difficult to adapt in a dynamic setting. They also do not cater to the privacy and fairness of individuals. Most of these studies employ human observers as post-hoc annotators for coding videos to detect traces of collaboration. To tackle these issues, we devise a human-based prototype where privacy, in-the-wild setting and dynamic design is at the centre of our WOz study.

3 A WOz Study: Designing the Research Prototype

Based on our analysis, we aimed for creating a flexible research infrastructure that allows us to study feedback in CC making use of different indicators and combining them in different feedback instruments and media. We followed a design-based approach focusing on a specific type of meeting and evaluated different types of indicators, human-observer interfaces, as well as feedback mechanisms. The main components of our research prototype are a defined set of indicators and sensors, a user interface for CC observation managed by human observers, as well as a set of feedback components.

3.1 Experimental Context

We performed the experiments during three PhD meetings with 3–7 members in each meeting in the room as shown in Fig. 1. Due to the frequent availability of these meetings and ease of not designing the task per se, we chose them. Our main focus was to execute the study in-the-wild and preserving the privacy. Thus, we used a human annotator who was present in the adjacent room separated by a one-sided transparent wall as shown in Fig. 2. Although it is difficult to see in the picture, the visibility through the wall from the side of the annotator was transparent; while the visibility from the meeting room was opaque. A microphone was used to listen to the conversation in the other room but audio was not recorded. The real-time feedback was shown on a big shared public display in the meeting room (as depicted in Fig. 3) which was managed by the annotator. The real-time feedback visualization could make use of observation data from the human observer and also visualize raw-data, e.g. the audio volume of the group work. The collaborators got a virtual sense of being tracked by a microphone automatically when they saw the changing real-time feedback of their speaking participation on the screen.

Fig. 1.
figure 1

Meeting room

Fig. 2.
figure 2

Annotator room

Fig. 3.
figure 3

Public display

3.2 Data Logging

For the sake of clarity in data logging, we have segregated the multimodal channel annotation into verbal and non-verbal (i.e., gestures and postures) channels and identified different non-verbal indicators as: looking at laptop or peers; looking down; looking at the feedback; typing with laptop; and making different hand gestures. The verbal indicators are: occurrence, pauses, overlaps, interruptions in speech; affirmatives in speech; and asking questions. But, to ease the logging process for the human annotator, we chose to only focus on the simpler observable audio cues which is the speaking time and turn taking of each group member in a first study. The speech-based cues are ubiquitous in any collaboration and non-verbal cues may be difficult to monitor for one annotator in a large group setting. The annotator was seeing the annotation interface embedded in a Google sheet as shown in Fig. 4. To preserve the privacy, we gave the annotator a coding sheet where each collaborating member was given an alias name from the English alphabet. Moreover, each participating member signed a consent form. Whenever a person starts speaking, the annotator pressed the corresponding button in the interface which automatically creates a cell in the Google sheet with the start time and name of that person. Whenever the annotator presses another person’s button, the end time of the previous person is registered in the sheet. This was possible as the buttons were coupled with a JavaScript to perform the operation. To ensure the reliability of the coding scheme, we had a provision to include multiple annotators but did not use it for our experiments as it involved only simple clicking of a button.

Fig. 4.
figure 4

Annotation interface

Fig. 5.
figure 5

Mid feedback

Fig. 6.
figure 6

End feedback

3.3 Modeling Participation During Collaboration

The sheet interface was connected to a chart embedded in Google Slides which was updated in real-time when a value is entered by pressing a button. The other columns in Google sheet were automatically populated based on the defined formula which calculates the cumulative speaking time of each member from the beginning of the meeting. Figure 5 shows the group dynamics after the first 30 min during a meeting using a line chart as displayed during the meeting on the big public shared display in the room. The times shown on the horizontal axis is the plot time obtained from the end time of speaking of a member. The value in vertical axis is the total speaking time in seconds from the beginning of the meeting. Figure 6 shows the status of the line chart at the end of the meeting. Here, the speaking time and turn taking represented the participation of each group member. We also collected oral feedback from both the annotators and the collaborators during the iterative design phase.

3.4 Results

From our first three iterations in the PhD meetings, we developed a first prototype for analyzing turn-taking and speaking time feedback. Our results showed that we need a higher level annotation interface. Thus, we supported human observers in that they only need to press a button when a new person starts talking. For the visualization on the public shared display, we experimented with different visualizations of the speaking time. Based on participants’ feedback we altered the display format from an original pie chart to a line chart for displaying the development of the conversation over time. An example of the feedback at different times of a meeting can be seen in Figs. 5 and 6. We can observe that speaker B, who is a second year PhD student, dominates the conversation in the first 30 min; it was his turn to speak regarding his PhD project at that time. But, from that time on-wards he stops to participate in the meeting; indicated by the line parallel to horizontal axis in Fig. 6. We can also observe at the end of the meeting that speaker A, who is the promotor, has spoken the most and changed turns very often to intervene during the meeting; the turn-taking was evident from the frequent change of the shape of the line indicated by small or large spikes.

4 Discussion

RQ1: On the multimodal indicators during CC indicating collaboration quality — Based on the literature study, we discovered different multimodal indicators during CC in multiple contexts. They can be grouped into social (i.e., verbal, non-verbal and physiological) and epistemological (i.e., ideas and data logs) indicators. For detecting the social indicators, sensors have been used in past works. But, for detecting the epistemological indicators human help was required as it is difficult for sensors to automatically detect the number of ideas generated from speech by understanding the semantics.

RQ2: On the feedback during CC — Feedback during CC is either real-time (for reflection or guiding) or post-hoc (for the purpose of reflection). This brings into the picture two stakeholders: the teachers (or facilitators) and the group members. We need this distinction as it will help in designing the feedback. Some works used TUI and other electronic mediums like Interactive White Boards (IWB) and tablets during collaboration which requires a lot of preparation before a collaborative task. Therefore, it is difficult to use it in real-world dynamic settings. Besides, there is a trade-off between personalization for the group and privacy. More personalized feedback meant for the whole group is less privacy preserving. Thus, there should be a decision on the level (i.e., group, individual or both) of feedback to be shown depending on the circumstances at hand.

On the research prototype to give real-time feedback — We take a step in building an initial prototype design with the aim to facilitate real-time collaboration during meetings. We were successful in building a click-based interface for the annotator which also reduces memory overhead. This helps us to create a hybrid setup without building an actual automated sensor-based system to experiment with different types of real-time feedback mechanisms during CC. We can later use these insights to build the sensor-based or hybrid setup. Here, we can build individual components in a modular fashion to track other indicators of collaboration quality; and integrate them to a single dashboard.

5 Conclusions and Future Work

Collaboration being an important skill and ubiquitously present in our day to day activities, we try to look into the different collaboration indicators in various contexts in the literature. We find different types of indicators like gaze, speaking time, posture, gesture, number of ideas generated, etc. Then we look into the impact of feedback during collaboration and find that visual real-time feedback has some impact on the collaboration like improving the equity of audio participation. This feedback can range from private displays (like PDA, mobile phones) to a more public one (like TUI, shared display).

Based on this overview, we took a step further and built a real-time feedback prototype during collaboration based on a privacy-preserving WOz study in-the-wild. Here, we study collaboration during co-located PhD meetings using human observers acting as a proxy for sensors. We find that the human observers could easily track ‘who spoke when and for how much time’ by pressing a button.

As future work suggestions, we need to define the goal and outcome of the collaboration task and make it clear in the evaluation criteria as to whether we measure collaboration as a process, outcome or both. Then, we can focus on the feedback mechanisms for facilitating collaboration. We can also borrow some insights from the mapping of multimodal data to feedback in an individual learning context [11]. The feedback can be: human based, sensor based or a hybrid of both. We need to decide the type (number of pointing gestures, speaking time, number of interruptions, number of eye contact with peers, etc.), modelling (i.e., individual, group or both) and display of feedback (i.e., personal, public or both) based on action-based research [14] where we need to take the preliminary feedback of different stakeholders like teachers (or facilitators) and the group members. Our long term goal is to do action-based research and build a sensor-based automated (or hybrid) feedback system during CC using the currently built research prototype. Here, we can include different feedback components to identify multiple indicators of collaboration and proceed towards an automated system using deep neural networks to integrate data from multiple sensors [31].