1 Introduction

Virtual Assistants (VAs), also known as cognitive assistants, have surged in popularity in the last 8 years, after the appearance of consumer products from Amazon, Microsoft, Google, and Apple. The catalyst of the modern vision of what a VA is and can do can be traced back to CALO (Cognitive Assistant that Learns and Organizes) [1], a project from DARPA to integrate existing AI technologies to create a cognitive assistant, which was redefined to mean a software agent that responds to human commands, questions in natural language, and performs tasks based on that input. Siri, the commercial VA from Apple, is a spin-off of CALO.

There is a long history of using a variety of AI systems for the design of complex systems. They have been used to represent and generate design alternatives [2], to search the solution space [3], to evaluate alternatives [4], and to provide interactive visualizations for the designer [5, 6].

In most of these systems, either the AI is a tool to the human designer [7] or the human is an input to an automated tool [8], but it is rare to find cases where collaboration between human and machine is emphasized. In contrast, current research in human—machine interaction suggests that a more collaborative approach may increase performance [9].

In this context, we started developing Daphne [10] three years ago, with the purpose of bringing the usability and cognitive unloading abilities of general VAs to the early design of complex systems, specifically Earth observation satellites. Daphne is centered on improving design space exploration, which is a vital part of the early design of complex systems.

Feedback gathered during exit interviews in a prior study with practitioners emphasized the importance of being able to justify the decisions from such studies [10]. This means the slightly negative trend for learning we observed in the experiment in [10] is worrisome, even though the method we used to measure learning in that experiment may not take into account some kinds of knowledge a test subject might gain while using Daphne. Even though we showed that performance does increase when using Daphne at its full capability, this means nothing if designers cannot justify the outputs of Daphne to stakeholders. This realization prompted development of explanation strategies for Daphne [11] and the study described in this paper.

In this study, we evaluate the response from users to different parts of Daphne, and we evaluate how learning and understanding of the problem changes based on the usage of Daphne, in hope of extracting conclusions on how to design VAs for engineering design problems that can both help improve performance and learning.

We define learning in the context of design exploration as improving the understanding of the structure of the design space, e.g., the trade-offs between different design criteria, the sensitivities of design criteria to design decisions, or the existence of families of similar designs with similar performance. Currently, there are no agreed upon metrics to measure human learning in design. Bang and Selva, inspired by Bloom’s taxonomy of learning [12], proposed that these metrics should encompass different cognitive processes such as remembering information, understanding concepts, analyzing the information, and creating new concepts [13].

2 Daphne Architecture

Daphne’s main components and data flow are described below. Daphne has a web frontend that provides access to its capabilities and acts as the main User Interface (UI) for the system. A screenshot from this interface can be seen in Fig. 1. There are 3 main areas in the interface. The left area has a menu with all the available functions; the center area contains the design space plot—which allows for design space exploration—and the different tools available to the user, including a Design Builder; and the right area has the chat history between Daphne and the user.

Fig. 1
figure 1

Daphne’s interface

The frontend is in charge of transmitting the user requests to the Daphne Brain, a web server in charge of forwarding all requests to the correct service. Requests can be natural language requests (text or voice) or classical interactions such as mouse clicks, hovering, or drag&drop.

Each request is processed by the Brain and sent to one of many roles, which are small programs that are in charge of handling groups of similar requests. All roles are also capable of being proactive and sending information to the user without a prompt, as described in [14].

There are 5 roles in Daphne. The first one is the Engineer, which is in charge of answering questions as a domain expert, as well as handling evaluation of the designs using models. Both functions of this role are supported by the VASSAR backend [15], a rule-based system for evaluating the performance and cost of Earth observation missions. The second one is the Analyst, whose job is to mine the dataset for knowledge on the shared features of designs in regions of interest in the objective space. This is supported by the iFEED backend [16], which searches for if—then rules that best explain a user-defined design region using a variety of rule mining algorithms. The third one is the Explorer, which controls a background search for better designs. The search is performed through the algorithm described in [17], which is an extension of a multi-objective genetic algorithm to include domain knowledge to make it more efficient. The fourth one is the Historian, which takes questions about past and current existing missions and answers them based on the data in the CEOS database.Footnote 1 The fifth and final one, the Critic, takes a design as an input and gives feedback on that design to the user. This feedback comes from all the other roles, but is synthetized in a few sentences from each role.

3 Experimental Design

To study the effects on learning and performance of using Daphne in the Earth observation task, we conducted a study at Texas A&M University with a diverse STEM student population. To address concerns about the measure for learning used in [10], we used a new, more holistic measure for learning in this study, which is described in the Dependent Variables subsection. We hoped to see that the more users interacted with Daphne, the more they learned. Also, while it was not the focus of the study, we explored if certain roles or functions of Daphne help users more than others in relation to both learning and performance.

Thus, we set the following hypotheses for the experiment:

  • H1: There is a positive correlation between degree of Daphne usage and learning about the problem.

  • H2: There is a positive correlation between degree of Daphne usage and performance on the design task.

  • H3: There is a difference in the task performance when using Daphne as a Peer vs Daphne as an Assistant.

  • H4: There is a difference in learning when using Daphne as a Peer vs Daphne as an Assistant.

4 Demographics

We recruited N = 26 Texas A&M Students from STEM degrees. Recruitment was through mass email on the university network, social media posts, and as part of a capstone design class for Aerospace Engineering students. All participants were promised a $15 gift card for a major online outlet as a token of appreciation for participation in the experiment. The main demographics are summarized below:

  • Age range: 20–33 years old

  • Gender: 20 identified as Male, 6 as Female, 0 as Others

  • Current degree: 14 were BS students, 7 were MS students, 4 were PhD students, and 1 was a postdoctoral researcher

  • Major: 12 were studying an Aerospace Engineering degree, while the rest were from various disciplines of STEM

  • Prior Experience in Satellite Design: 7 subjects had previous experience, while 19 did not

5 Experiment Protocol and Conditions

After signing all relevant IRB forms, each test subject sat down on a computer provided by us. Then, the subject was exposed to a tutorial explaining the protocol being described here and how to use all the functions of Daphne. The tutorial was interactive and thus had no time limit. This made the experiment have a variable duration, but we noticed in past experiments that limiting the tutorial time hurt performance. Once the tutorial was done, each test subject had to solve the design task under two different conditions, Peer vs Assistant. Each condition’s available features in Daphne are detailed in Table 1 below. Participants were given 15 min to solve the task for each condition. After each task, the test subject was asked to complete a learning test, which was not time limited. At the end of the experiment, a semi-structured exist interview as conducted where subjects were asked to give their opinion on the experiment, the tool, and the task, to gather feedback for improving the system and the experiment.

Table 1 Features available in each condition

Each subject performs two tasks (one per condition). The experimental design is between-subjects for H1 and H2 and within-subjects for H3 and H4. Each participant solved a problem of similar difficulty for each condition, and the order in which the conditions were given to the user was randomized to decrease the learning effect. All interactions with Daphne were recorded, from questions asked through the natural language interface to button clicks and hovering.

6 Task Details

The task given to the test subjects was the same as in [10] in order to allow for comparisons. Subjects were asked to design a satellite system to monitor soil moisture. They were given a set of 5 candidate orbits (e.g., different altitudes and inclinations) and a set of 5 candidate instruments (e.g., different types of infrared and microwave sensors), and were asked to assign instruments to orbits with no constraints: every instrument can be in any subset of orbits, including none and all of them. The VASSAR backend [15] was used to assess the scientific value and cost of each design. Specifically, test subjects were tasked with finding a set of designs that push the boundary of the cost-science tradeoff (more formally the Pareto front) for costs between $800M and $4000M.

7 Dependent Variables

  1. 1.

    Performance: The true Pareto front for this design task is not known, so in order to measure performance we found an approximation of this optimal set by running a multi-objective genetic algorithm [17] for 10,000 evaluations. With this reference set, we defined the performance in the task as the distance between the user’s Pareto front and the “reference” one found with the genetic algorithm. This distance was measured through the Hyper-Volume (HV) metric, a well-known metric in multi-objective optimization. We normalized the metric by bounding it between 0, if the subject’s HV is the same as the starting one, and 1, if it is as good as the HV of the reference set.

  2. 2.

    Learning: One of the main limitations in past experiments was the metric for measuring learning. For this paper, we build on a study by Bang and Selva [13] on measures of learning for tradespace exploration problems. Their conclusion is that a learning test must target different cognitive processes such as remembering, understanding, analyzing, and creating. To do this, we defined three tests. The first one consists of 12 identification questions, where for a design chosen from the dataset the user was asked whether that design is close to the Pareto front or not. The second test also has 12 questions. For each question, the subject was asked to find the highest science design out of two designs that have a similar cost. For both tests, the test subjects were also asked to rate their confidence in their answers. Finally, we asked each subject three subjective questions on learning, to measure their perception of their own learning.

  3. 3.

    Usability: We conducted a standard usability survey after each task: The System Usability Scale (SUS) [18]. It consists of 10 Likert items, and has been validated in a large number of software usability studies, including intelligent systems.

  4. 4.

    Trust: We conducted a standard trust in automation survey after each task: the Jian’s Trust in Automated Systems Scale [19]. It consists of 12 Likert items. This survey has been validated in a multitude of studies on automation, including VAs.

8 Results

In order to test both H1 and H2, we collected many usage statistics from each test subject during the 15 min they were performing the task: number of questions asked to Daphne, number of designs evaluated, and number of interactions. More detailed information was also recorded such as the number of interactions with each role (Critic, Engineer, Analyst, Historian), number of designs found by the Explorer vs the subject, etc. Then, the dependent variable data were separated in two groups based on usage (more usage versus less usage) and tested for difference in means. A selection of the results is plotted below. For the sake of brevity, Fig. 2 only details the interesting results for H1, while Fig. 3 only represents the interesting results for H2. Most variables had no correlation or trend and are not plotted. As a disclaimer, results from 8 test subjects were omitted because the Explorer did not work for those users and thus their scores could not be fairly compared to the others.

Fig. 2
figure 2

Correlations between usage (#questions, #interactions, #designs evaluated) of the Daphne VA and learning

Fig. 3
figure 3

Correlations between usage and performance

The plots in Fig. 2 show a trend of increased learning with increased usage, but the p-values for the t-tests are 0.14, 0.15, and 0.13 for #questions, #designs, and #interactions respectively, which are not significant.

The plots in Fig. 3 also show a trend of increased performance with increased usage, albeit weaker than that of the learning. The p-values for the t-test are 0.22, 0.42, and 0.76 respectively.

Finally, we also found an interesting relationship between the perceived usability (U), perceived trust (T), learning (L), and performance (P), which are detailed in Fig. 4 below. In this case, the p-values for the null hypothesis of zero slope (or no correlation) are 0.12 (L vs T), 0.00035 (L vs U), 0.07 (P vs T), and 0.57 (P vs U).

Fig. 4
figure 4

Correlations between trust, usability, learning, and performance

In order to test for H3 and H4, we compared the performance on the tasks and the learning scores of users when they used Daphne as a Peer vs Daphne as an Assistant. The distributions of results are shown in Fig. 5. They show no appreciable difference, and the p-values for the t-test confirm it, with values of 0.20 for the performance and 0.75 for learning.

Fig. 5
figure 5

Distributions of task performance and learning for each condition

9 Discussion

The results support H1 and H2, but not H3 and H4. The first two hypotheses are not supported with much strength, especially the second one. The trend that can be observed in most variables plotted for both H1 and H2 is that the more a user interacts with Daphne, the higher the lower bound is for both learning and performance. Some users are able to get great results with few interactions with the system, but having this lower bound raised by simply using the system more for the same amount of time is a result worth pointing out. If further studies can confirm this trend, we have an actionable way of fostering good learning and performance. This trend is also what one would expect: the more someone uses a system, the more proficient they become at it, so they also get more out of it.

We observe a strong correlation between the perceived usability of Daphne and how much learning there is according to the learning test score. It makes sense that the more a user learns, the more it finds Daphne usable, and vice versa.

Similarly, we observe that trust is correlated with performance. Although the relationship is not as strong as that of usability and learning, it seems that again, performing well with the system leads to higher trust scores with it. A further causality study could confirm these two findings.

We learn from the results that the tools available to the designer under each condition improve learning or performance by the same amount. A future study should look at whether the combination of the two roles (Peer + Assistant) results in higher learning or performance than either.

As far as time spent using different functionalities, users spent the most time using functions in the graphical UI, including both Design Space Exploration and Design Creation. These are followed in usage by backend roles such as the Analyst role and the Peer role. The Historian role was not used at all, and the Engineer role went almost unused by most users.

As the results lack significance, we also studied the qualitative feedback from the users’ exit interviews. One consistent piece of feedback both in the interview and usage metrics is that subjects preferred roles such as the Data Mining and the Critic when compared to other roles such as the Engineer and Historian. Some subjects mentioned that these roles are more efficient to use in time-constrained situations such as the one in this experiment. The same effect was not seen when analyzing data from a previous experiment with subject matter experts; in fact, subjects familiar with satellite design mentioned that the Engineer tools were more helpful to them. A future experiment comparing the usage patterns of expert vs non-expert populations is needed to learn more about the trends we are seeing here, cater Daphne to the users that will end up using it, and learn whether it is appropriate to use students as subjects for further experiments.

10 Conclusion

This paper described an experiment to try to improve our understanding of the relation between key parameters in human-machine collaborative design space exploration. Specifically, we measure how using a VA may improve learning and performance in design space exploration. The main takeaways from this experiment are that increased interaction is linked to increased performance and learning, and that trust, usability, performance and learning tend to go hand in hand. We also found that STEM students (not real designers) prefer features that help them synthesize large amounts of data, as they spent more time using those features than others.

This study is not without limitations. Most results are not statistically significant, so more experiments are needed in order to confirm or deny the trends we have seen. Another important limitation is the allotted time for each experiment, which can be too short to both perform well on the task and learn meaningful facts about it. This means the findings and recommendations in this paper can be proven false in the future and should be tested independently. Further research is also warranted to understand the differences between non-experts (students) and expert practitioners in their usage of VAs and their various roles for design space exploration.