Keywords

1 Introduction

A proposed definition of adaptive instructional systems (AISs) is as follows: “computer-based systems that guide learning experiences by tailoring instruction and recommendations based on the goals, needs, and preferences of each learner in the context of domain learning objectives.” [1]. The purpose of an AIS is to optimize learning for a learner (or team of learners), based on some learning objective. This can be mastering a skill, gaining a competency or learning to achieve a task. Traditionally, AISs are built to instruct human students. However, they are not the only students that might benefit from an AIS. In this paper we explore AISs for training computational learners.

Reinforcement learning (RL) is a class of machine learning techniques that enables intelligent (virtual) agents to learn behaviors through interaction with an environment [2, 3]. Rather than being explicitly taught, a RL agent learns from its experiences through trial-and-error. In the last decade, due to algorithmic innovations and the increased availability of computational power, many advances have been made in the field that allow agents to solve increasingly complex problems in intricate environments [4]. As a consequence, RL algorithms are beginning to find their way in the applied research domain where agents can be employed for applications such as simulation-based training or analysis to take on the role as intelligent role-players or entities.

Meanwhile, there is no one-size-fits-all solution and it remains a challenge to exploit new RL algorithms and develop agents capable of learning and performing specific tasks in a target application domain. While a RL agent may have the learning capability (i.e., the algorithm) to learn complex tasks, it still needs to be trained and made fit-for-purpose for the target domain. Such operationalization of RL agents involves for instance: (1) setting up a training goal by the design of tasks for the teaching of skills and knowledge; (2) forming an instructional strategy, viz. a training design or curriculum to train the learner on those tasks; and (3) providing a learning environment that is representative for the target operational environment. Although there is a wealth of literature that addresses algorithmic improvements for the learning mechanisms, providing experimental results in various (often low-fidelity, toy-like) domains, there is less guidance available on general design strategies for exploiting, training, and integrating RL agents in target domains.

In this paper, we advocate the use of an adaptive instructional system (AIS) as a conceptual framework to design and teach RL agents. We will show that, although AISs were originally intended to optimize human training, many analogies can be drawn between (1) the functions provided by an AIS and (2) the functions required to train RL agents. In the mapping between the functions of an AIS and RL, we abstract away from the specific algorithmic technology (cf. the learner) and focus on the training design and instructional process, namely teaching a learner to operate in a task domain. Similar to how the fields of cognitive and neuroscience have inspired and continue to influence RL research (leading to improved learning algorithms), instructional sciences may inspire and influence training methods for RL agents (leading to optimized learning through instruction). The aim of this paper is thus to promote synergies and exchange lessons learned between practitioners of AIS research and practitioners of RL research.

We begin this paper by presenting our motivation (Sect. 2). Next, we compare and analyze the training concepts relating to a human learner and an agent learner using examples from RL literature (Sect. 3). We continue by zooming into the central concept, namely a training system by means of an AIS, where the actual learning takes place (Sect. 0). Here, we discuss how an AIS may be applied to teach a RL agent. Following this discussion, we illustrate the views presented in Sect. 3 and 4 by outlining a system that operationalizes learning agents within a concrete target domain, namely a military training simulation system in the air combat domain (Sect. 5). We conclude the paper with a general discussion of our analysis and opportunities for future directions (Sect. 6).

2 Motivation

In this paper we advocate the use of AIS theory as an instrument to operationalize RL agents. Here, operationalize means the process of employing a general-purpose (untrained) RL algorithm, and training it such that it can operate and perform in a desired target or task environment. It addresses the question of how to make a RL agent ‘fit-for-purpose’. Our four motivations for applying AIS theory to RL agents are the following.

  1. 1.

    Reaching a conceptual development framework for RL. An AIS provides a conceptual framework for computer-based teaching of learners. It offers a structured method for addressing key design questions, such as how to define the task, skills or knowledge to teach; how to define the expected standard of performance; how to measure and evaluate the learner against this standard; and what instructional strategies to apply. Where RL research usually has a prime focus on improving learning algorithms, an AIS focuses on optimizing the learning process through training in a specific task domain. The latter receives less attention in RL research but is a crucial step to successfully integrate RL technology in industrial applications. AIS theory can be used to fill this gap, both from a theoretical and a technical viewpoint.

  2. 2.

    Technology abstraction for RL research. The concept of an AIS provides a clear separation between the algorithmic learning technology, in other words a learner, and the task domain and environment it is applied to. By abstracting the learner from the training process and the environment using an AIS, the application domain can be designed independently, regardless of the underlying learning technology. New RL algorithms can then more easily be integrated or exchanged, with less impact on the system as a whole. Although specific technical solutions exist for common interfaces between RL algorithms and environments (such as OpenAI Gym [5]), we aim for a more holistic view on this abstraction.

  3. 3.

    Working towards a mixed-learner AIS. In a shared human-agent training environment, an AIS has the potential to fulfil multiple purposes. For instance, consider a simulation-based training environment where an AIS is used to train humans operating in teams. A common approach is to employ agents as simulated role players that can replace human roles. In such an environment, a similar AIS for training humans can potentially also be used for training agent role players. From an implementation perspective, there are many commonalities for developing such an AIS, in terms of a domain model, training needs and goals, performance measurements and evaluations, and a learning environment.

  4. 4.

    Identifying cross-field research opportunities. The concept of AISs can promote research synergies between machine learning and instructional science, similar to the bi-directional synergy between machine learning and cognitive and neuroscience [6]. On the one hand, theories on instructional strategies can be used to optimize computational learners (e.g. consider transfer learning techniques [7]). On the other hand, simulations of instructional strategies applied to computational learners can lead to insights on optimization techniques for human learning (an example for part-task training is described in [8]). In this context, an AIS for RL agents can act as an experimentation testbed to explore, test and validate instructional strategies for humans. A key challenge here is the construction of a computational learner that can act as a representative model for human learning.

3 Comparative Analysis of Training Concepts

To support the idea of applying concepts originating from human instructional systems to agent learners, a broad context of training is required. In this section we provide a comparative analysis of training concepts using a basic model of a training process, as shown in Fig. 1. The model focuses on teaching a learner (human or agent) some task using a training system, hereby ‘updating’ the learner such that it can be deployed to an operational environment.

In the remainder of this section we describe different concepts from Fig. 1 using analogies between human and agent learning and training, based on literature on machine learning. Section 3.1 presents the learner as the subject of training. Section 3.2 discusses the need analysis by which tasks can be defined that form the training objective for the training system. Section 3.3 describes the training system in terms of instructional strategies. Finally Sect. 3.4 addresses how the learner, post-training, can be deployed to the operational environment.

Fig. 1.
figure 1

Basic model of a training process

3.1 The Learner, Its Learning Mechanism and Priors

The learner is the subject that undergoes training. In the remainder of this paper, the term agent learner or human learner is used when a distinction is called for; otherwise the term learner is used.

In order to be qualified for training, a learner must meet two requirements. First, it must embody a learning mechanism that is capable of learning the task. Second, a training can demand a minimum set of priors (previously obtained skills and knowledge) from the learner. We discuss the learning mechanism and the priors below.

Learning Mechanism.

For a human learner, the learning mechanism is the human brain (i.e. the ‘hardware’). In contrast, agents and their learning mechanisms (i.e. algorithms) are invented and engineered. Thus when the goal is to build specific task-capable agents, choosing or developing the learning algorithm becomes part of the design process. Current mainstream machine learning algorithms are limited to narrow AI, implying they are limited to learn a specific task in a scoped context (the task domain). There is no one size fits all and different types of algorithms are specialized for different types of capabilities. Therefore it is essential to align the learner’s algorithm with the nature of the task to learn. This requirement is discussed further in Sect. 3.2.

Priors.

For a human learner, the priors are basic abilities, skills and knowledge acquired through earlier life experiences or prior education and training (i.e., the ‘software’). Two types of priors can be distinguished in agent learners, namely pre-trained models and encoded knowledge (discussed below). They differ in how they are established, either implicitly obtained from an earlier training, or explicitly encoded by a designer:

  • Pre-trained models. These are the result of a previous training or learning experience on a different but related task. The use of such models for a new training process is known as transfer learning. This can significantly increase training efficiency for a target task, requiring less data (training samples) and computational time. Pre-trained models are used extensively for deep learning on vision tasks [9] or language tasks [10]. Widespread use in those domains is possible because of their general nature (dealing with images or language). In contrast, pre-trained models for behavioral (RL) tasks are far less common because of the specificity of the tasks and the domain in which they are applied. Still, when transferring between highly comparable tasks, transfer learning can be applied successfully [7]. Developing priors through pre-training can be seen as an instructional strategy, part of the training system. This cycle of learner transfer is illustrated in Fig. 1.

  • Encoded knowledge. This form of explicit priors represents symbolic knowledge that is encoded directly into the learner, for instance by a subject matter expert (SME). It can represent inference or decision-making rules, beliefs or goals. Certain learning algorithms depend on the existence of encoded knowledge, prior to any training. For instance in h-DQN [12], a hierarchical deep reinforcement learning algorithm, a human specifies a set of goals that represent sub-tasks in complex task domains. Consequently, an agent learns how to achieve individual goals and when to pursue them to achieve a higher-level goal; or in dynamic scripting [11], a rule-based machine learning algorithm, a human must author a rule base of possible if-then rules, after which the learning algorithm optimizes behaviour rules scripts determined through trial-and-error.

3.2 Needs Analysis and Training Objective

In the design of a training system, a needs analysis is performed to define the training objective. In human training, methods such as a job, task or competency analyses are performed to gain insight into the task activities and required knowledge, skills and abilities (KSA). Such analysis is typically part of the first phase of the Analyze, Design, Develop, Implement, and Evaluate (ADDIE) process for instructional design [13].

In RL, a needs analysis is also important to gain insight into the requirements for the learner in terms of cognitive abilities. These abilities must be aligned with the capabilities of its learning mechanism. In other words, can it be expected that the learner will be capable of learning the task; does it have the required ‘hardware’? If basic required abilities are lacking, the learner is not suited for the training in question.

A classic example from literature is the deep Q-network (DQN) algorithm which has shown impressive results on many task domains, but fails on tasks that require the ability of simple planning of sequential tasks [4]. Different algorithms are equipped with different architectural features to support certain abilities, as seen for instance for capabilities of visual-spatial processing [14], long-term planning [2] or short [15] or long-term memory [16]. It can be difficult to quantify and relate human-like cognitive abilities to abilities that can be expressed by machine learning algorithms. Sometimes, abilities are not always known and limitations may only be found through experimentation. Still, an analysis is crucial to be confident that a specific algorithm is capable, or to use as guidance to select or dismiss candidate algorithms.

3.3 Training System

The training system offers the learner a training environment in which it can learn the task. Instructional strategies are designed and implemented with the goal to optimize the learning process within the learner regarding task performance. Below three methods are described: adaptive instruction, curriculum learning and social learning.

Adaptive Instruction.

For human learning, adaptive instruction can be used to keep the learner within the ‘zone of proximal development’ by balancing the task challenge and the competence level of the learner [17]. To keep the learner engaged, the task should not be too easy or too complicated, leading to possible boredom or anxiety consequently. For agent learners a similar balance is required, though not for the purpose of keeping learners motivated, but for functional reasons of learning efficiency [18].

Adaptive instruction generally centers around two instructional strategies: (1) guiding the level of support or (2) change the nature of the content [1]. The former is an explicit form of aiding by a tutor (scaffolding), such as giving hints or advice. The latter is an implicit form of aiding through adaptation of the task environment, such as changing the difficulty level. These two strategies are described next for agent learners:.

  1. 1.

    Scaffolding. Scaffolding in RL can be associated with reward shaping as a technique to guide the learning process using feedback signals (rewards), indicating positive or negative trends towards the training objective [19]. It is an indirect approach to incorporate task knowledge into the learner through inference. For sub-symbolic algorithms (e.g. neural networks) that do not allow prior encoding of task knowledge, feedback signals are the only medium for an agent infer any knowledge on the task. When any prior task knowledge (the ‘rules of the game’) cannot be communicated to the learner, this generally leads to slow and ‘sample-inefficient’ learning.

  2. 2.

    Environment adaptation. Adaptation of the learning environment or scenario can be used to change the difficulty level of the task to optimize learning [20]. Part of adaptive design is to identify so-called complexity factors to be used as control dials for adaptations. During training, such adaptations can guide the learner from simple to more complex environments. In RL, there is another purpose for environment adaptation, namely to optimize the coverage of the possible observational input of the learner, without specifically changing complexity. This allows the learner to generalize a learned behavior to all possible contexts. Through smart and fast scenario adaption, this can significantly increase learning efficiency. This feature is less relevant for human learners as they are more proficient in this cognitive ability.

Curriculum Learning.

Curriculum learning is a method of training based on the idea of gradually increasing task complexity. The idea of curriculum learning for machines dates back to 1993 [21] and is currently a well-researched area in RL [22]. In curriculum learning the goal is to strategically decompose a task (a whole-task) into so-called part-tasks. Such decomposition could be based on e.g. different sub-tasks, certain skill-sets or options for scaling the complexity of the environment.

By presenting part-tasks to a learner in a well-designed curriculum can significantly speed-up learning. However a bad designed curriculum can also lead to worse performance, compared to solely training on the whole-task. Depending on the implementation, curriculum learning can be seen as form of adaptive instruction when used within a single training session, or as consecutive trainings with different task objectives (indicated by the learner loop from Fig. 1).

Social Learning.

Social learning is a form of learning where new knowledge and skills can be acquired by imitating or observing others. In RL, comparable techniques have been used successfully. In imitation learning, an agent learns to imitate behavior based on demonstrations given by another agent [23]. In observational learning, an agent learns from observing another agent performing a task in a shared environment [24]. The difference is subtle. In the former, an agent purely imitates behavior, regardless of the correctness of the demonstrated behavior with respect to the task. In the latter an agent learns the underlying task by observing the actions of the teacher and its effects on the environment, i.e., it learns both good and bad behavior in achieving the task.

Both approaches can be performed in an online or offline fashion. In an online setting, the teacher is present together with the learner in a shared training environment (e.g. human or other agent). In an offline setting, learning is based on existing/recorded datasets where no interactive environment is available. This is also termed as offline RL [25] or data-driven RL [26]. Offline learning can be seen as building priors before training and is a favored approach for bootstrapping the learning process of learner.

3.4 Operational Deployment

After successful training, the learner is qualified for deployment in the operational environment. Relevant aspects that come into play are the validation of the learner, the transfer from the learning to the operational environment, and continuous learning.

Validation of the Learner.

A validation process judges the performance of the learner on the task in the operational environment. In human learners validation can take the form of e.g. qualification assessments, tests or examinations. Several validation approaches exist that can also be applied to agent learners:

  1. 1.

    Benchmarking. In benchmarking, the performance of a learner is compared to some reference point. This method can be used when clear metrics can be defined and measured to score a learner on task performance. It is used extensively for RL agents in games, since scoring mechanisms are commonly available and performance can easily be compared against human-based reference points [4]. Benchmarking is also a popular technique to compare different learning algorithms on the same task environments.

  2. 2.

    Test scenarios. This method is borrowed from traditional supervised learning and is based on dividing predefined scenarios into sets of training and test scenarios. This approach is used to prevent overfitting of the task performance in situations experienced during training and be able to generalize to situations occurring during operation. Test scenarios can be used after training to judge this ability of the learner.

  3. 3.

    Human judgement. This approach assesses learners’ behaviors based on subjective human opinions such as from SMEs. As a human-in-the-loop approach, it is commonly not favorable as it is a resource intensive process. However, quantitative metrics cannot always be defined to fully cover all facets of a learner’s performance. For instance, consider measuring aspects such as realism or human-likeness when agents require to simulate the role of human players. An example of a validation procedure for agents as human role players for air combat training is seen in [27].

Transfer of Training Environment.

This is the ability of the learner to apply the task learned in the training environment to the operational environment. For the scope of this paper we assume that for RL, final training in the operational environment is always required. Therefore this type of transfer is also represented in Fig. 1 by the learner transfer loop towards a follow-up training with the same task but different environment.

Continuous Learning.

Human learners inherently continue learning after training and can become more proficient in task performance due to experience on-the-job. Certain professions require learners to initiate recurrent or refresher training, as skill-decay can occur over time. For agent learners, continuous learning is a design choice.

An advantage of online learning is that the learner can continuously adapt to evolving environments, beyond the scope that may have been taken into account during training. However compared to a training environment, agents in operational environments may not be permitted to make critical mistakes and may require adherence to safety constraints. Techniques such as safe reinforcement learning addresses this issue [28].

A possible downside is that validated performance cannot be guaranteed any longer. For instance, continuous optimization on certain aspects of a task could lead to decreased performance on other aspects. This is known as the stability-plasticity problem in learning algorithms where plasticity is required to integrate new skills, but also stability to retain existing skills [29]. Similar to human learners, recurrent training and validation can be used to ensure the retention of skills.

4 Adaptive Instructional System for Learner Agents

In this section we zoom in on the training system that is central to Fig. 1 and present it as an adaptive instructional system (AIS). For professional training of human learners, AISs are becoming more in demand, due to the availability of high fidelity simulation-based training environments and the trend towards more personalized training to suit the needs of individual learners. In this section we explore the application of the AIS concept to agent (cf. computational) learners, as opposed to human learners. First, we present the definitions for the components of an AIS that we use in this section.

4.1 Defining the Components of an AIS

Adaptive instructional systems are often characterized by four functional components [30]. Although the exact division of functions and responsibilities of these components and their interactions varies from system to system, we use the following descriptions:

  1. 1.

    Domain Model: defines the task objectives and the domain (expert) knowledge on the task to be learned, such as the skills, knowledge or problem-solving strategies. Besides information on what has to be learned, it provides a performance standard: metrics or indicators that can be used to judge the learner’s performance and progress with respect to the task.

  2. 2.

    Learner Model: based on the task objectives and performance standard, the learner model measures and evaluates the learner’s progress towards this standard. Additionally it maintains the learner’s evolving states (e.g. cognitive, affective, motivational or physiological states) that can be used to adapt the training and optimize the learning process.

  3. 3.

    Instructional Model: based on the current learner’s evaluation and mental states, the instructional model implements the system’s instructional strategies. It plans, coordinates and applies teaching activities through direct interventions (such as providing feedback to the learner) or indirect interventions (such as adapting the learning content or environment).

  4. 4.

    Interface Model: provides a user interface for the instructional system to interact with the learner and learning environment.

4.2 Reinforcement Learning by an AIS

When an AIS is considered for training an agent learner, the functions of an AIS resemble the typical functions of a learning algorithm. Below, we illustrate this using a basic description of RL.

RL algorithms are based on the concept that an agent learns sequences of decisions (called a policy) by experiences obtained through interaction with an environment. An agent continuously takes actions based on its current policy and observes a new state of the environment. In parallel, the agent receives feedback signals (called rewards) from an external critic, to indicate positive or negative trends towards some goal. The learning process revolves around correlating these signals with the agent’s own actions and the resulting changes in the environment in order to update its policy, with the goal to maximize future accumulative rewards.

Fig. 2.
figure 2

AIS for an agent learner

Different aspects of a RL algorithm can be mapped to the functions of an AIS, as defined in Sect. 4.1. Figure 2 illustrates this mapping. The Domain Model contains the training goal of the RL algorithm, such as winning a game, defeating an enemy or accomplishing a task. In a RL algorithm, the goal is often not explicitly stated but rather kept in the head of the designer. The Learner Model judges the current policy of the agent by measuring its behavior, either directly or indirectly through environment observations (the latter is not shown in the illustration). The Instructional Model determines what feedback to give to the agent and when. This is commonly implemented by a reward function. Finally, the Interface Model defines the actions and observations for the agent’s interface with the environment. In addition to the AIS functions, Fig. 2 shows two concepts: (1) the ability to bootstrap the agent with priors, such as starting with pre-trained models or encoding domain knowledge (cf. Sect. 3.1), and (2) the ability of indirect intervention through environment adaptation (cf. Sect. 3.3).

In the mapping presented in Fig. 2, a basic RL algorithm is essentially dissected into its individual functions, each one addressing a key question for a designer of a training system:

  • How to define the task and its performance metrics?

  • What behaviors to measure from the learner to evaluated its performance?

  • What instructional strategies to apply?

  • What sensors and actuators does the learner require to perform its task?

Although RL algorithms come in many different forms and flavors, these core questions apply to all when applying them to a task domain.

4.3 Advantages of an AIS

The main advantage of mapping the training process of RL agents in the framework of an AIS becomes apparent when agents are considered for operation within a broader application scope, rather than a mere demonstration of learning abilities in a ‘toy-problem’ environment. In such a broader scope, designers of RL agents may be bound to external system constraints such as available simulation environments, or be dependent on SMEs to provide task objectives, task analysis and performance standards.

In this respect, the “AIS view” of a RL agent provides a separation of the learning, focusing on the agent’s cognitive abilities and treated as a black box, from training, focusing on applying these abilities to teach concrete tasks relevant for some target application domain. This enhances the abstraction from the specific algorithmic learning technology that is applied and makes it easier to consider and integrate alternative or improved algorithms as they become available, with limited impact on the instructional components. In Sect. 5 we illustrate the embedding of learner agents in a broader application scope.

4.4 Dependencies on the Learner’s Capabilities

An AIS forces a view that separates internal learning processes within the agent from external instruction. These two parallel processes could potentially lead to conflicts. This is best described by the exploration-exploitation dilemma in RL [3]. Exploration is the intentional deviation of the agent from its currently known policy by performing (semi-) random actions, with the goal to explore its task domain and potentially find new optimal solutions to the task which may not have been found if it would solely pursue its current policy in a greedy fashion. From an external instruction point of view, this behavior may seem erroneous where an AIS could have the tendency to correct this intentional behaviour.

Exploration strategies belong to the category of RL challenges that deal with optimizations for agents to learn faster in understanding the scope of task domain and the role they play in it. Other well researched internal learning strategies are intrinsic motivation (developmental learning) [31] and meta-learning (learning to learn) [32]. In human learners, analogous internal learning processes are present. However, these tend to be more stable and predictable during adulthood. When initiating training, one can have the prior assumption that humans are equipped with ‘startup software’ acquired during childhood, and that they have the ability to generalize and apply previously learned problem-solving strategies to new situations. This lacking ability of agents is currently one of the biggest challenges in machine learning [6]. The consequence is that RL agents often need to be trained ‘from scratch’. This should be taken into account when designing an AIS. For instance, for humans, one wouldn’t use the same AIS designed to train professional skills for a child as for an adult.

5 Towards a Domain Implementation

In this section we show an example of how agent learners can be made fit-for-purpose for a concrete application domain, using an AIS and combining the insights from Sect. 3 and 0. The example domain that is used is that of a military (training) simulation system in the air combat domain where agents adopt the role of simulated fighter pilots. This use of agents in this domain has been well researched, both for RL algorithms [33,34,35] and simulation systems for human training [36, 37].

A high-level design for the training system implementation is shown in Fig. 4. It provides the infrastructure for (1) an air combat training simulation environment with aircraft models, (2) pilot agents with a learning algorithm and acting and sensing capabilities (for the aircraft’s navigation, sensor and weapon systems in the environment), and (3) an instructional component to form an AIS (see subscript of Fig. 4).

The role of the AIS is to teach individual tasks that can be stored in a Task Library and reused by scenario developers to ‘compose’ pilot behaviors for a desired scenario. A task could represent defending a section of air space; a tactical engagement with an opponent; or individual tactical maneuvers such as formation flying or performing an evasive maneuver. Task compositionality offers the freedom of choosing different suitable learning algorithms for different task types and has been proposed in this application domain before [38, 39]. In the remainder, the AIS process of teaching tasks is described for two concrete examples, each handled by a different learning algorithm.

5.1 Task Learning

We have implemented two learning algorithms for different tasks. The first one is a neural network-based RL algorithm that is used for low-level aircraft maneuvering tasks. The second is a rule-based RL algorithm that is used for tactical-level tasks. The latter allows for the encoding of task knowledge from SMEs, for instance to constrain behaviors to specific tactics and procedures.

Figure 3 shows a scenario that combines the use of these task types. The scenario is a 2v2 encounter between blue and red forces in three phases: the ingress, engagement and egress phase. A formation flying task has been defined for the ingress and egress phases; and a complete 1v1 engagement task covers the engagement phase. The training of these two tasks with their selected learning algorithms is described below.

Fig. 3.
figure 3

An air-to-air encounter in a 2v2 scenario. During ingress (left), blues fly in formation towards red; following, individual 1v1 engagement take place between blue and red (middle); during egress (right), blue forces regroup and exit in formation after successful engagements. (Color figure online)

Fig. 4.
figure 4

Training system for an agent learner. From the AIS point of view, the learner is the Agent (middle left); the Domain Model is represented by the Task Domain (top); the Learner Model and Instructional Model are embedded in Adaptive Instruction (middle right); and the Interface model is represented by the control and observation interfaces of the agent and tutor.

Task: Formation Flying

In this task the agent is taught a maneuver to keep relative position and attitude relative to another aircraft (i.e. keeping a formation). The learning algorithm used is a neural network-based RL algorithm (DQN). To teach this maneuver, the AIS generates suitable training scenarios. During a scenario, the tutor continuously evaluates the agent’s performance and provides feedback to the learning algorithm. The agent will learn to control the navigation systems of the aircraft, including heading and speed.

To speed-up learning, smart scenario adaptation is used to optimize the learning experiences and thereby the learning speed of the agent. This involves (1) preparing new scenarios with (semi)-randomized initial positions and bearings between the learning agent and the reference agent, and (2) terminating and starting a new scenario when either a set amount of time has passed or when the distance between the two agents becomes too large.

The task can be parameterized with distance and bearing configurations. These parameters become additional inputs to the learning algorithm and are varied between training scenarios. The algorithm thus learns many task variations in the same training. The resulting task model can then be used by scenario developers for all sorts of formation configurations in 3D-space with any number of aircrafts. The task is used for both the ingress (approach) and egress (regrouping) phase in the scenario in Fig. 3.

Task: Engagement

In this task the agent is taught to perform a 1v1 engagement with an enemy aircraft. The learning algorithm that is used dynamic scripting which has been demonstrated in the air combat domain before [34, 40]. In contrast to the previous task, this algorithm uses priors defined by an SME. This is a rule database from which the learning algorithm can select and tryout a subset of rules with the goal to learn optimal combinations.

In contrast to the previous task, the AIS only provides feedback to the learning algorithm once at the end of a scenario (i.e. an engagement). Based on performance metrics such as a win or loss, missile usage or time, an evaluation is made to determine the success of the currently used rules. Based on this evaluation, the agent selects a new subset of rules for a new scenario. Over time, successful (combination of) rules are enforced and unsuccessful ones will fade. Similar smart scenario adaptation is used to speed-up learning: e.g. a new scenario is started when either one of the agents is killed, no more missiles are left, or some maximum scenario time has been reached.

5.2 Data-Driven Task Learning

One aspects from Fig. 4 that has not yet been discussed (and has not been implemented) is the Task Samples shown in the Task Domain. These represent datasets that could be obtained from simulator recordings (e.g. human demonstrations) or external sources (e.g. live recordings) that relate to the task to be learned. Such datasets could be used for offline training to bootstrap the learning process of the agent. For instance, a pre-trained model for the task could be learned by means of offline imitation or observational learning (see social learning in Sect. 3.3), and then be used as a prior for the online training. An approach for such data-driven learning in the domain of military simulation is described in [41] and termed data-driven behavior modelling (DDMB).

5.3 Concluding

The training system that was described attempts to touch upon all design aspects described in Sect. 3 and 4 and should be considered for training an agent. Summarized, these concern the need to perform a needs analysis to define the task and choose a suitable learning mechanism; the option or requirement to implement priors (pre-trained models or encoded knowledge); the implementation of the actual adaptive instruction through AIS components; and possible validation strategies to apply.

Although the use of different learning algorithms require different strategies for the AIS, many processes and components can be shared. The installment of reusable AIS components enhances abstraction from the learning algorithm and limits the impact on the system as a whole when new algorithms are introduced.

6 Discussion and Future Directions

In this paper we argued the use of AISs to operationalize RL agents for fit-for-purpose application domains. The key role of an AIS is to introduce a training system that bridges the gap between having an agent with a general-purpose learning algorithm and training that agent to operate in some task domain.

As AISs have their foundation in human instruction, we analyzed the process of training by drawing comparisons between human and agent learning and training in Sect. 3. We found that many concepts from RL research can be mapped to concepts from human instruction and vice-versa. Of course there remain fundamental differences between human and agent learners: e.g. machines don’t get tired or bored, and machine learning algorithms have yet to parallel the ability of human learners to generalize problem-solving and learn from few experiences. As a consequence, different instructional approaches are deemed necessary, tailored to the different characteristics of the learners. The same is true also between two RL algorithms with different mechanisms for learning. Still, the overall goal of instruction is the same, which is to optimize training for learners in a task domain. Having a shared high-level framework for training and adaptive instruction as presented in Sect. 3 and 4 helps to place the various requirements, processes and strategies for teaching RL agents into perspective.

From an implementation perspective, the AIS concept seems to be a good fit for teaching agents. It abstracts away from the algorithmic learning technology, allowing system designers to focuses on the design of the instructional components around it. This is demonstrated in Sect. 5 by an implemented training system where the task domain can be shared by both agents and humans.

In this paper, a first step is taken towards applying AIS theory to teaching agents. For future directions, this line of thought can be continued. For instance, exploring the potential alignment with standardization and interoperability efforts of AISs [42, 43]. Similar approaches can benefit the development of training systems for agent learners. As learning algorithms will continue to become more advanced and are expected to be equipped with capabilities parallel towards human learning, the more likely it will be that instructional strategies between humans and agents become more comparable. This will encourage further cross-field synergy between RL and instructional science that can be explored using AISs.