Keywords

1 Introduction

Interactive e-Learning Environments such as Intelligent Tutoring Systems (ITSs) and educational games have become increasingly prevalent in educational settings. In domains like math and science, solving a problem often requires producing one or multiple steps, each of which is the result of applying a domain principle or rule. For example, \(2x +5 =9\) can be solved for x in two steps: (1) subtract the same term 5 from both sides of the equation; and (2) divide both sides by 2. Tutoring in such domains is thus often structured as a two-loop procedure [35]: the outer loop makes problem level decisions, such as problem selection; while the inner loop controls step level decisions, such as whether or not to give hints or give a feedback. As a result, there are decisions to make and opportunities to give at different levels of granularity, such as hints, worked examples, immediate feedback, or suggested subgoals, and some are more important or impactful than others. Human decision-makers treat these distinct levels of granularity differently and are capable of selecting between them [7, 12].

Data-driven approaches, and especially reinforcement learning (RL), have been shown to improve the effectiveness of ITSs [4, 5, 9, 10, 19, 28, 29, 39]. However, most prior applications of RL for pedagogical policy induction treat all system decisions equally or independently and do not account for the long-term impact of higher-level actions or the interaction of decisions made at different levels. In this paper, we propose and apply an offline, off-policy Gaussian Processes-based (GP-based) Hierarchical Reinforcement Learning (HRL) framework to induce a hierarchical pedagogical policy at two levels of granularity: problem and step. More specifically, our HRL policy will first make a problem-level decision and then make step-level decisions based on the problem-level decision. In this study, for example, our HRL policy first decides whether the next problem should be a worked example (WE), problem solving (PS), or a faded worked example (FWE). In WEs, students observe how the tutor solves a problem; in PSs students solve the problem themselves; in FWEs, the students and the tutor co-construct the solution. Based on the problem-level decision, the HRL policy then makes step-level decisions on whether to elicit the next solution step from the student, or to show it to the student directly. We refer to such decisions as elicit/tell decisions. If WE is selected, an all-tell step policy will be carried out; if PS is selected, an all-elicit policy will be executed; finally if FWE is selected, the tutor will decide whether to elicit or to tell a step based on the corresponding induced step-level policy. Both WE and PS can be seen as two extreme ends of FWEs. Therefore, one non-hierarchical way to make decisions would be to focus on step-level decisions alone.

In a classroom study, we compared the HRL induced hierarchical policy (HRL) with two step-level policies: a Deep Q-Network induced policy (DQN) and a random yet reasonable (Random) policy because both elicit and tell are always considered to be reasonable educational interventions in our learning context. 180 students were randomly assigned to three conditions and our results showed that the HRL policy was significantly more effective than the DQN and Random policies, and no significant difference was found between the two latter policies. For time on task, no significant difference was found between the HRL condition and Random but the former (HRL) spent significantly more time than DQN. Finally, the induced HRL policy is more likely to select PS and FWE than WE, which confirmed our hypothesis that HRL would provide the right balance to pedagogical decision making, targeting WEs and tells to just those problems and steps that need them.

2 Background and Related Work

2.1 Previous Research on Applying RL to ITSs

Generally speaking, RL approaches can be classified as online, where the agent learns a policy in real time by interacting with the environment, or offline, where the agent learns from pre-collected training data. RL approaches can also be divided into on-policy vs. off-policy, based on the relationship between their behavior and estimation policies [32]. In on-policy RL, the behavior policy used to control how the agent explores the environment (online), or collects training data (offline), is the same as the estimation policy being learned. In off-policy methods, these two policies may be unrelated. Both online and offline RL approaches have been used for pedagogical policy induction in recent years; among them, prior research mainly took an off-policy RL approach [3, 4, 9, 10, 13, 19, 28, 36, 39]. Next, we will describe prior RL work from the online vs. offline perspective.

Online RL research to induce pedagogical policies has often relied on simulations or simulated students. As a consequence, the success of these approaches is heavily dependent on the accuracy of the simulations. Beck et al. [3] applied temporal difference, with off-policy \(\epsilon \)-greedy exploration, to induce pedagogical policies that would minimize the students’ time on task. Iglesias et al. applied another common online, off-policy approach named Q-learning to induce policies for efficient learning [9, 10]. More recently, Rafferty et al. applied POMDP with off-policy tree search to induce policies for faster learning [19]. Wang et al. applied an online, off-policy Deep-RL approach to induce a policy for adaptive narrative generation in educational game [36]. All of the models described above were evaluated via simulations or classroom studies, yielding improved student learning and/or behaviors as compared to baseline policies.

Offline RL approaches, on the other hand, “take advantage of previous collected samples, and generally provide robust convergence guarantees” [25]. The success of offline RL is thus often heavily dependent on the quality of the training data. One common convention is to collect an exploratory corpus by training students on an ITS that makes random yet reasonable decisions and then apply RL to induce pedagogical policies from that corpus. Shen et al. applied value iteration and least square policy iteration on a pre-collected training corpus to induce pedagogical policies aimed at improving students’ learning performance [27, 28]. Chi et al. applied policy iteration to induce a pedagogical policy aimed at improving students’ learning gains [4]. Mandel et al. [13] applied an offline POMDP approach to induce a policy which aims to improve student performance in an educational game. In classroom studies, most models above were found to yield certain improved student learning relative to a baseline policy.

Despite these successes, the necessity for accurate simulations (online) or large training corpora (offline) has limited the wide use of RL for policy induction. Additionally, prior research on both online RL and offline RL has not taken the granularity of decisions into account when applying RL techniques for the induction of pedagogical policies. In the remainder of the paper, we will refer to these approaches as flat RL to differentiate them from our new HRL approach.

It has been widely shown that HRL can be more effective and data-efficient than flat RL approaches [6, 11, 18, 22, 37]. HRL generally breaks down a large decision-making problem into a hierarchy of small sub-problems and induces a policy for each of them. Since the sub-problems are small, they usually require less data to find the optimal policies. For example, Cuayhuitl et al. induced navigation policies [6] at 3 levels: buildings, floors, and corridors, showing that HRL converged to an optimal policy in much fewer iterations. Peng et al. showed success using temporal HRL to induce locomotion control policies for path following and soccer dribbling while flat policies could not complete these tasks [18]. Although promising, the use of hierarchy requires additional information, such as the transitions and rewards at different levels of granularity, to induce a policy, and this may be hard to get from pre-collected data. Therefore, most existing HRL applications have been online, but here, we propose and apply an offline, off-policy HRL approach. To the best of our knowledge, this is the first attempt to apply HRL to induce pedagogical policies.

2.2 WE, PS and FWE

Prior research has investigated the effectiveness of WE, PS, FWE, and their various combinations [14,15,16,17, 21, 23, 26, 31, 33]. When focusing on PS and WE, Mclaren et al. found no significant difference in learning performance between studying WE-PS pairs and doing PS-only, but the former spent significantly less time than the PS-only [16]. In a subsequent study, Mclaren et al. compared three conditions: WE-only, PS-only and WE-PS pairs [15]. Similarly, no significant differences were found among them in terms of learning gains, but the WE condition spent significantly less time than the other two; and no significant time on task difference was found between PS-only and WE-PS pairs.

Several studies were conducted comparing different combinations of WE, PS, and FWE. Renkl et al. compared WE-FWE-PS with WE-PS pairs, and the former significantly outperformed the latter on student learning performance while no significant difference was found between them on time on task [21]. Similarly, Najar et al. compared adaptive WE/FWE/PS with WE-PS pairs [17]. They found that the former significantly outperformed WE-PS pairs in terms of learning outcomes and the former also spent significantly less time on task than the latter. For adaptive WE/FWE/PS, they used expert rules to make decisions based on student learning states. Finally, Salden et al. compared three conditions: WE-FWE-PS, FWE, and PS-only [23]. Their results showed that FWE outperformed WE-FWE-PS, which in turn outperformed PS-only, and no significant time on task difference was found among the three conditions. Note that in their study, the order of WE, FWE, and PS were fixed in WE-FWE-PS; while in FWE, the tutor used an adaptive pedagogical policy, expert rules combined with data-driven student models. In short, previous studies have shown that alternating among WE, PS, and FWEs can be more effective than only alternating between WE and PS; however, it is not clear whether the former can be more effective than only using FWEs. On the other hand, prior research either used a fixed policy (WE-FWE-PS) or hand-coded expert rules combined with data-driven student models to make decisions. In this work, we applied an offline, off-policy HRL framework to derive a hierarchical pedagogical policy directly from empirical data. Its effectiveness is directly compared against another data-driven FWE policy induced by applying one of the state-of-the-art flat RL methods: Deep Q-Network.

3 Policy Induction

In this work, both our proposed HRL framework and DQN are offline, off-policy in that they induce policies from a historical dataset \(\mathcal {D}\) collected by training students on the ITS that makes random yet reasonable decisions. RL focuses on inducing effective decision making policies for an agent with the goal of maximizing the agent’s cumulative rewards. In many domains, RL is applied with immediate rewards. In an automatic call center system, for example, the agent can receive an immediate reward for every question it asks because the impact of each question can be assessed instantaneously [38]. Immediate rewards are generally more effective than delayed rewards for RL-based policy induction. This is because it is easier to assign appropriate credit or blame when the feedback is tied to a single decision. The more we delay the rewards or punishments, the harder it becomes to assign credit or blame properly. The availability of immediate rewards is especially important for HRL approaches. On the other hand, the most appropriate reward to use in ITSs is student learning gains, which are typically unavailable until the entire training process is complete. This is due to the complex nature of the learning process which makes it difficult to assess students’ learning moment by moment and more importantly, many instructional interventions that boost short-term performance may not be effective over the long-term. Therefore, we first proposed and applied a Gaussian Processes based (GP-based) approach to infer “immediate rewards” from the delayed rewards and then applied HRL and DQN to induce the corresponding hierarchical or step-level policies based on the inferred immediate rewards. In the following, we will briefly describe: (1) our proposed GP-based approach to infer immediate rewards, (2) our offline, off-policy GP-based HRL framework, and (3) DQN. We now present a few critical details of the process, but many have been omitted to save space.

3.1 GP-Based Approach for Immediate Reward Inference

Our historical dataset \(\mathcal {D}\) consists of student-ITS interaction trajectories with different lengths. Each trajectory d can be viewed as: \(s_{1} \xrightarrow {a_{1}, r_{1}} s_{2} \xrightarrow {a_{2}, r_{2}} \cdots s_{n} \xrightarrow {a_{n}, r_{n}}\). Here \(s_{i} \xrightarrow {a_{i}, r_{i}} s_{i+1}\) indicated that at the \(i_{th}\) turn in d, the learning environment was in state \(s_{i}\), agent executed action \(a_{i}\) and received reward \(r_{i}\), and then the learning environment transferred into state \(s_{i+1}\). Because our primary interest is to improve students’ final learning, we used Normalized Learning Gain (NLG) as the reward because it measures students’ gain irrespective of their incoming competence. \(NLG = \frac{posttest-pretest}{\sqrt{1-pretest}}\) where pretest and posttest refer to the students’ test scores before and after the ITS training respectively and 1 is the maximum score. Given that a student’s NLG will not be available until the entire training is completed, only terminal states have non-zero rewards. Thus for a trajectory d, \(r_{1}\) \(\cdots \), \(r_{n-1}\) are all equal to 0, and only the final reward \(r_{n}\) is equal to the student’s \(NLG \times 100\), which is in the range of (\(-\infty \), 100].

To infer the immediate rewards from the final delayed reward for each trajectory, we applied Gaussian Processes (GP) to learn a distribution function f for the expected values and the standard deviations of all of the immediate rewards. More specifically, a prior probability is given to each possible function before observation. Then, higher probabilities are given to the functions where the sum of the generated immediate rewards is close to the observed delayed reward. In other words, the immediate rewards inside each trajectory were inferred by minimizing the mean square error (MMSE) of additive Gaussian distributions [8]. The immediate rewards were distributed inside each trajectory by assuming that they follow Gaussian distributions and that these rewards add up to the delayed reward for each trajectory. Following the Gaussian Process Regression [1, 20] and the shared mutual information existed in the feature representation, we can thus infer the immediate rewards from delayed rewards.

3.2 An Offline, Off-policy GP-Based HRL for Policy Induction

Most HRL research is based upon an extension of Markov Decision Processes (MDPs) called Discrete Semi-Markov decision processes (SMDPs) and the central idea behind the HRL approach is to transform the problem of inducing effective pedagogical policies into one of computing an optimal policy for choosing actions in SMDPs. An MDP describes a stochastic control process and formally corresponds to a 4-tuple: <S,A,T,R>. When inducing pedagogical policies, the states S are vector representations composed of relevant learning environment features such as the difficulty level of a problem, percentage of the correct entries a student entered so far and so. In this study, we have a total of 142 state features to describe the learning environment; the actions A are selected from \(\{WE, FWE, PS\}\) for problem-level decisions and from {elicit, tell} for steps; the reward function R is calculated from the system’s success measures: students NLG. Once the \(\{S, A, R\}\) has been defined, the transition probabilities T are estimated from the training corpus, \(\mathcal {D}\). Once a complete MDP is constructed, calculation of an optimal policy via policy iteration is straightforward.

SMDPs extend the existing MDP framework with the addition of a set of complex activities [2] or options [30], each of which can invoke other activities recursively, thus allowing for hierarchical policy functions. The complex activities are distinct from the primitive actions in that a complex activity may contain multiple primitive actions. In our applications, WE, PS and FWE are complex activities while elicit and tell are primitive actions. A complex activity consists of three elements: a policy \(\pi \) that maps states to each available option, a termination condition, and an initiation set. A solution to the SMDP mentioned above is an optimal policy (\(\pi ^*\)), a mapping from state to complex activities or primitive actions, that maximizes the expected discounted cumulative reward for each state.

The complex activities in SMDPs can take a variable number of low-level activity (or actions) to execute across multiple time steps. This makes it necessary to extend the state-transition function to take into account the activity length. If an activity a takes \(t'\) time steps to be executed in state s, then the state transition probability function given s and a is defined by the joint distribution of the result state \(s'\) and the number of time steps \(t'\) when action a is performed in the state s: \(P(s', t'| s, a)\). The expected reward function is also extended to accumulate over the waiting time in s given action a. More specifically, the Q-value function Q(sa) represents the expected discounted reward the agent will gain if it takes an action a in a state s and follows the policy to the end and for SMDP, the Bellman equation can be re-written as:

$$\begin{aligned} Q(s,a)=R(s,a)+\sum _{s',t'} \gamma ^{t'} P(s',t'|s,a) \max _{a'\in A} Q(s',a') \end{aligned}$$
(1)

\(0 \le \gamma \le 1\) is a discount factor. If \(\gamma \) is less than 1, then it will discount rewards obtained later. For HRL, learning occurs at multiple levels. The global learning generates a policy for the top level decision and local learning generates a policy for each complex activity. This process retains the fundamental assumption of RL: that goals are defined by their association with reward, and thus that the objective is to discover actions that maximize the long-term cumulative reward. Local learning focuses not on learning the best policy for the overall task but the best policy for the corresponding complex activity.

In our offline off-policy HRL framework, both problem- and step-level policies were learned by recursively using the Gaussian Processes to estimate the Q-value function in Eq. 1. Using an actor-critic policy iteration framework, we iteratively update the policy. This process continues until the Q-value function and the induced policy converged. We assume that the Q-value function follows a prior distribution and by combining the prior of Q-value function and the inferred immediate rewards, the Gaussian Process Regression can provide the posterior distribution of the Q-value function approximation in a tractable way. In this work, our training corpus contains a total number of 1118 students’ interaction logs collected from a series of seven prior studies which followed the identical procedure and learning materials as the students in this study described below. To induce the hierarchical policy, we defined a problem-level semi-MDP for determining whether the next problem should be WE, PS or FWE and for each of the training problems, we defined a step-level semi-MDP for inducing a step-level policy to determine elicit vs. tell if a complex activity FWE is selected for that training problem.

3.3 DQN for Policy Induction

A Double DQN approach [34] with the prioritized experience replay technique [24] was applied to induce the DQN step-level policy. A multi-layer perceptron neural network was used to approximate the Q-function. The inputs to the neural network were the last 3 step observations of a student and the outputs were the Q values for each possible step level action (in our case, elicit and tell). The network consists of two 64-unit layers with the rectified linear unit (ReLU) activation function (except that the output layer has no activation function). As a convention for this algorithm, an experience replay buffer and a target network were used to stabilize the training. The data and immediate rewards used for DQN policy induction were identical to those used for HRL.

4 Empirical Experiment

Participants. This study was conducted in the undergraduate Discrete Mathematics course at the Department of Computer Science at North Carolina State University in the Fall of 2018. The study was given as one of the regular homework assignments; students had one week to complete it and were graded based upon their demonstrated effort rather than performance. Students (N = 180) were randomly assigned into three conditions (60 in each of HRL, DQN, and Random). Due to preparations for exams and the length of the experiment, 140 students completed the study. 3 students who scored perfectly in the pre-test were excluded from our subsequent analysis. In addition, 9 students who completed the study in groups were excluded. The remaining 128 students were distributed as follows: \( N = 44\) for HRL, \( N = 45\) for DQN, and \( N = 39\) for Random. A \(\chi ^2\) test shows that the participants’ completion rate did not differ by condition: \( \chi ^2 (2) = 1.03, p = 0.598\).

Pyrenees is a web-based ITS that teaches students a general problem solving strategy and 10 major principles of probability, such as the Complement Theorem and Bayes’ Rule. It provides students with step-by-step instruction, immediate feedback, and on-demand help. Specifically, the help is provided via a sequence of increasingly specific hints. The last hint in the sequence, i.e., the bottom-out hint, tells student exactly what to do. Except for the decision granularity, the remaining components of the tutor, including the GUI interface, the training problems, and the tutorial support were identical for all students.

Procedure. All three conditions went through the same four phases: (1) textbook, (2) pre-test, (3) training on the ITS, and (4) post-test. The only difference among them was the policy employed by the ITS. During textbook, all students read a general description of each principle, reviewed some examples, and solved some training problems. The students then took a pre-test which contained a total of 14 single- and multiple-principle problems. Students were not given feedback on their answers, nor were they allowed to go back to earlier questions (this was also true for the post-test). During training on the ITS, all three conditions received the same 12 problems in the same order. Each domain principle was applied at least twice. Finally, all students took the 20-problem post-test; 14 of the problems were isomorphic to the pre-test. The remainder were non-isomorphic multiple-principle problems.

Grading Criteria. The pre- and post-test problems required students to derive an answer by writing and solving one or more equations. We used three scoring rubrics: binary, partial credit, and one-point-per-principle. Under the binary rubric, a solution was worth 1 point if it was completely correct or 0 if not. Under the partial credit rubric, each problem score was defined by the proportion of correct principle applications evident in the solution. A student who correctly applied 4 of 5 possible principles would get a score of 0.8. The One-point-per-principle rubric in turn gave a point for each correct principle application. All of the tests were graded in a double-blind manner by a single experienced grader. The results presented below were based upon the partial-credit rubric but the same results hold for the other two. For comparison purposes, all test scores were normalized to the range of [0, 100].

5 Results

Despite of random assignment, a one-way ANOVA analysis on the pre-test score showed a marginally significant difference among the three conditions: \( F (2,125) = 2.805\), \( p = 0.064\), \( \eta = 0.043\). Subsequent contrast analysis showed that DQN scored significantly higher than HRL: \( t (125) = 2.06\), \( p = 0.042\), \( d = 0.46\) and Random: \( t (125) = 2.01\), \( p = 0.046\), \( d = 0.46\); but there is no significant difference between HRL and Random: \( t (125) = 0.02\), \( p = 0.986\), \( d = 0.00\). The results suggest that while our random assignment indeed balanced the HRL and Random conditions’ incoming competence, it did not do so for the DQN condition. Therefore, we mainly focus on comparing learning performances that consider the pre-test differences, that is, adjusted post-test and NLG especially the latter because it is the reward we used for policy induction.

Table 1 shows the mean and standard deviation (SD) of students’ learning performance and total training time results across three conditions. From left to right, it shows the condition with the number of students in parentheses, pre-test (Pre), isomorphic post-test (Iso Post), full post-test (Full Post), adjusted post-test (Adj Post), Normalized Learning Gain (NLG), and the total training time on the ITS in hours (Time).

Table 1. Learning performance and time on task

Isomorphic Post-test. To measure students’ learning improvement, we compared their isomorphic post-test scores with their pre-test scores. A repeated measures analysis using test type (pre-test vs. isomorphic post-test) as a factor and test score as the dependent measure showed a main effect for test type: \( F (1,127) = 158.63\), \( p < 0.0001\), \( \eta = 0.555\) in that students scored significantly higher in the isomorphic post-test than in the pre-test. More specifically, all three conditions scored significantly higher in the isomorphic post-test than in the pre-test: \( F (1,43) = 110.74\), \( p < 0.0001\), \( \eta = 0.720\) for HRL, \( F (1,44) = 34.73\), \( p < 0.0001\), \( \eta = 0.441\) for DQN, and \( F (1,38) = 38.47\), \( p < 0.0001\), \( \eta = 0.503\) for Random. This showed that the basic practice and problems, domain exposure, and interactivity of our ITS effectively help students acquire knowledge, even when the decisions are made randomly yet reasonably.

Adjusted Post-test. To comprehensively evaluate students’ final performance, we performed analysis on the full post-test score which has an additional six multiple-principle problems. An ANCOVA analysis on the post-test using the pre-test score as a covariate showed a significant difference among the three conditions: \( F (2,124) = 3.86\), \( p = 0.024\), \( \eta = 0.030\). Subsequent contrast analysis on the adjusted post-test score showed that the HRL condition scored significantly higher than the DQN condition: \( t (125) = 2.53\), \( p = 0.013\), \( d = 0.57\) and the Random condition: \( t (125) = 2.36\), \( p = 0.020\), \( d = 0.52\). No significant difference was found between DQN and Random. The results suggest that the HRL policy is significantly more effective than the DQN policy and the Random policy.

NLG. Similarly, a one-way ANOVA analysis on the NLG showed that there is a significant difference among the three conditions: \( F (2,125) = 4.39\), \( p = 0.014\), \( \eta = 0.066\). Subsequent contrast analysis showed that the HRL condition scored significantly higher than the DQN condition: \( t (125) = 2.75\), \( p = 0.007\), \( d = 0.66\) and the Random condition: \( t (125) = 2.30\), \( p = 0.023\), \( d = 0.52\). Again, no significant difference was found between DQN and Random. The results suggest again that the HRL policy significantly outperformed the DQN policy and the Random policy.

Time on Task. A one-way ANOVA analysis on time on task showed a significant difference among the three conditions: \( F (2,125) = 4.74\), \( p = 0.010\), \( \eta = 0.071\). More specifically, the HRL condition spent significantly more time than the DQN condition: \( t (125) = 3.07\), \( p = 0.003\), \( d = 0.62\) and marginally significantly more time than the Random condition: \( t (125) = -1.75\), \( p = 0.082\), \( d = 0.39\).

Table 2. Step level tutor decisions

Tutor Decisions. Our preliminary log analysis revealed that for the HRL condition, the average number of problem-level decisions students received are: .95(1.16) for WE, 5.07(2.58) for PS and 3.98(2.49) for FWE. Thus the HRL policy was more likely to choose PS and FWE than WE. Table 2 shows the number of step-level decisions students received across the three conditions. The first column shows the condition followed by the number of elicit and tell and finally the percentage of tell. Our preliminary step-level log analysis results showed that the HRL condition received more elicit than tell; while the other two conditions received a relatively balanced amount. A one-way ANOVA analysis on the percentage of tell revealed a significant difference among the three conditions: \( F (2,125) = 71.47\), \( p < 0.0001\), \( \eta = 0.533\). Subsequent contrast analysis showed that the HRL condition received significantly less tell than the DQN condition: \( t (125) = -10.00\), \( p < 0.0001\), \( d = 1.78\) and the Random condition: \( t (125) = -10.60\), \( p < 0.0001\), \( d = 2.42\). In addition, the HRL and the DQN condition had a much higher SD on tell percentage. This suggests that the HRL policy and the DQN policy made more personalized decisions than the Random policy.

6 Conclusion and Discussion

In this study, we proposed and applied an offline, off-policy GP-based HRL framework to induce a hierarchical pedagogical policy. The policy makes decisions first at the problem level and then the step level. At the problem level, it decides whether the next problem should be WE, PS or FWE. If FWE is selected, a corresponding step-level policy will be activated to decide whether the next step should be elicit or tell. In an empirical classroom study, we compared the HRL policy with a DQN induced step-level policy and a Random step-level policy. Our results showed that the HRL policy was significantly more effective than the DQN policy and the Random policy and no significant difference was found between the latter two policies. For time on task, there was no significant difference between the HRL condition and the Random condition, but the former spent significant more time than the DQN condition. Finally, the HRL policy was more likely to choose PS and FWE than WE.

The results suggest that HRL can be more effective than flat RL in pedagogical policy induction. One possible explanation is that HRL has an explicit problem-level vision. At the problem level, HRL views a problem as an atomic action, and this abstraction has two potential advantages: (1) it aggregates the effects of all steps in a problem and (2) it converts a long step-level sequence into a short problem-level sequence. The aggregation of steps across a problem may provide HRL with a better estimation of the effect of taking a series of steps; while the problem sequence may give HRL a better view of the long-term effects of each problem. Theoretically, flat RL could learn the impact of a problem by aggregating step-level information, but there is no guarantee that it would. Our results confirm the intuition that HRL should outperform flat RL on pedagogical policy induction because it can simultaneously learn at two levels of granularity - the problem level outer loop and the step level inner loop.