1 Introduction

The current state-of-the-art humanoid robots are equipped with passively compliant elements. In addition to inherent safety and enhanced interaction capabilities, availability of passive elements, e.g., springs, opens up new possibilities for improving the energy efficiency (Xiaoxiang and Iida 2014). For instance, the springs can be used for temporary energy storage by compressing them, and for energy reuse by releasing the stored energy (Ugurlu et al. 2014; Geyer et al. 2006). The remaining difficult open problem is how to address the best use of the described mechanism. This paper tackles the problem of finding the optimal way to exploit the passive compliance in a walking robot for the purpose of energy efficiency.

The conventional state-action-based reinforcement learning approaches suffer severely from the curse of dimensionality. To overcome this problem, policy-based reinforcement learning approaches were developed. Instead of working in huge state/action spaces, they use a smaller policy space, which contains all possible policies representable with a certain choice of policy parametrization. Thus, the dimensionality is drastically reduced, and the convergence speed is increased.

In order to find a good solution, i.e., a policy that produces a reward very close to the optimal/desired one, the policy parametrization has to be powerful enough to represent a sufficiently large policy space so that a good candidate solution is present in it. If the policy parametrization is very simple, with only a few parameters, then the convergence is quick, but often a sub-optimal solution is reached. If the policy parametrization is overly complex, the convergence is slow, and there is a higher possibility that the learning algorithm will converge to some local optimum, possibly much worse than the global optimum. The level of sophistication of the policy parametrization should be just the right amount, in order to provide both fast convergence and a sufficiently optimal solution.

Deciding what policy parametrization to use, and how simple/complex it should be, is a very difficult task, often manually performed via trial-and-error sessions by the researchers. This additional overhead is usually not even mentioned in related literature and falls into the category of "empirically tuned" parameters, together with the reward function, decay factor, exploration noise, weights, and so on. Since changing the policy parametrization requires to restart the learning process from the scratch, this approach is slow and inefficient as all the accumulated data needs to be discarded. As a consequence, the search for new solutions often cannot be done directly on real-world robot systems; rather, simulation studies are performed for proof of concept. To remedy these issues, we propose an approach that allows changing the complexity, i.e., the resolution, of the policy representation dynamically while the reinforcement learning is running.

The rest of the paper is organized as follows: Sect. 2 provides an overview of the state of the art in multiple research areas which are relevant to the interdisciplinary nature of this paper. In Sect. 3, the evolving policy parametrization approach is introduced, and a prototype implementation using cubic splines is proposed. Moreover, the proposed approach is evaluated via simulation studies. Section 4 explains details concerning the bipedal walking generation scheme, our bipedal robot, and its passively compliant joints. In Sect. 5, the real-world experiments conducted on our passively compliant bipedal robot are thoroughly described and analyzed. In Sect. 6, obtained results are discussed and some inevitable limitations are disclosed. Finally, the paper is concluded in Sect. 7 by stating the end results and addressing the future directions.

2 Background

2.1 Related work to policy-based RL algorithms

A tremendous effort has been done by researchers in machine learning and robotics to move RL (Reinforcement Learning) algorithms from discrete to continuous domains, thus extending the possibilities for robotic applications (Peters and Schaal 2006; Theodorou et al. 2010a; Coates et al. 2009; Guenter et al. 2007). Until recently, policy gradient algorithms such as Episodic REINFORCE (Williams 1992) and Episodic Natural Actor-Critic eNAC (Peters and Schaal 2008a) have been well-established approaches to cope with the high dimensionality. Unfortunately, they also have shortcomings; such as, high sensitivity to the learning rate and the exploratory variance. Trying to overcome this drawback, the following two recent approaches were proposed.

Theodorou et al. (2010b, 2010a) proposed an RL approach for learning parametrized control policies based on the framework of stochastic optimal control with path integrals. They derived update equations for learning so as to avoid numerical instabilities. This is due to the fact that neither matrix inversions nor gradient learning rates are required. The approach demonstrates significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems, such as control of a quadruped robot.

Abdolmaleki et al. (2016) introduced the contextual relative entropy policy search concept that adapts the robot walking controller for different contexts through the use of radial basis functions. The method enabled the controller to learn a policy which adjusts control parameters for a simulated NAO humanoid as it walked forward with a continuous set of walking speeds.

Kober and Peters (2009) developed an episodic RL algorithm called Policy learning by Weighting Exploration with the Returns (PoWER), which is based on Expectation Maximization algorithm. One of its major advantages over policy-gradient-based approaches is that it does not require a learning rate parameter. This is desirable because tuning a learning rate is usually difficult for control problems, but critical for achieving good performance of policy-gradient algorithms. PoWER also demonstrates high performance for the tasks that are learned directly on real robots, such as underactuated pendulum swing-up, ball-in-a-cup task, and dynamic pancake flipping task (Kormushev et al. 2010).

2.2 Related work to adaptive-resolution RL

Adaptive resolution in state space has been studied in various RL algorithms (Bernstein and Shimkin 2010). Moore and Atkeson employed a decision-tree partitioning of state-space and apply techniques from game-theory and computational geometry to efficiently and adaptively concentrate high resolution on critical areas (Moore and Atkeson 1995). They address the pitfalls of discretization during reinforcement learning, concluding that in high dimensionality it is essential for the learning not to plan uniformly over the state space. However, in the context of RL, adaptive resolution in the policy parametrization remains largely unexplored so far. This paper is making a step exactly in this direction.

2.3 Related work to robot trajectory representations

In order to plan and optimize a trajectory, it first needs to be encoded in a certain way. For instance, cubic splines could be utilized to achieve this task. Similar approaches have been investigated in robotics literature and often called as trajectory generation with via-points.

As an example, Miyamoto et al. (2004) used an actor-critic reinforcement learning scheme with via-point trajectory representation for a simulated cart-pole swing up task. The actor incrementally generates via-points at a coarse time scale, while a trajectory generator transforms via-points to primitive action at the lower level.

Morimoto and Atkeson (2007) proposed a walking gait learning approach in which via-points are detected from the observed walking trajectories, and RL modulates the via-points to optimize the walking pattern. The system is applied to a planar biped robot fixed to a boom that constrains the robot motion within the sagittal plane. Exploration tries to minimize the torques while keeping the robot above the desired height to prevent it from tipping over.

Wada and Sumita (2004) developed a via-points acquisition algorithm based on actor-critic reinforcement learning, where handwriting patterns are reproduced by an iterative and sequential generation of short movements. The approach finds a set of via-points to mimic a reference trajectory by iterative learning, with the help of evaluation values of the generated movement pattern.

Liu et al. (2015) proposed a behavior-based locomotion controller. The approach includes feed-forward and feedback mechanisms which correspond to motor patterns and reflexes. An optimization module supports the controller to minimize energy consumption while ensuring stability for a simulated humanoid robot.

Rosado et al. (2015) used the kinematic data that is collected from human walking via VICON system so as to train a set of dynamic movement primitives. These trained motion primitives then used to control a simulated humanoid robot in task space.

Shafii et al. (2015) utilized central pattern generators to modulate generated bipedal walking trajectories with varying hip height. Covariance matrix adaptation evolution strategy enabled the robot controller to search for feasible hip height patterns and walking parameters in a way to optimize forward velocity.

Koch et al. (2015) presented a bipedal gait generation method through the use of movement primitives that are learned from dynamically consistent and optimal trajectories. Morphable movement primitives were learned using Gaussian processes and component analysis. The method allowed the fast real-time movement generation for a simulated HRP-2 robot.

2.4 Related work to energy-efficient motion generation

Passive dynamic walkers are known to be energy-efficient mechanisms since they are able to make use of the swinging limb’s momentum while walking forward (McGeer 1990). The downside is that this type of bipedal walking is not able to handle human interaction or disturbance rejection even if the robot is actuated (Wisse et al. 2005). Moreover, there are application differences between these types of walkers and 3D fully actuated bipedal robots. In contrast, this paper does not focus on energy optimization from the viewpoint of exploiting the passive walking principle.

A few approaches exist for reducing the energy consumption on fully actuated 3D bipedal walkers (Amran et al. 2010; Minekata et al. 2008), but not in the context of learning a varying-CoM-height walking, as presented in this paper. Previously, machine learning approaches have been successfully used for learning tasks on bipedal robots, such as dynamic balancing, quadruped gait optimization (Kohl and Stone 2004), and whole-body control during kinesthetic teaching (Kormushev et al. 2011a). One especially promising approach for autonomous robot learning is reinforcement learning, as demonstrated in Pastor et al. (2011), Stulp et al. (2010), Kormushev et al. (2010), Guenter et al. (2007), Rosenstein et al. (2006).

Stulp et al. (2010) presented a Policy Improvement with Path Integrals (\(\hbox {PI}^2\)) RL approach for variable impedance control, where both planned trajectories and gain schedules for each joint are optimized simultaneously. The approach is used to enable the robot to learn how to push and open a door by minimizing the average stiffness gains controlling the individual joints, with the aim to reduce energy consumption and to increase safety.

Kormushev et al. (2010) presented the use of Expectation-Maximization-based RL for a pancake flipping task to refine the trajectory of the frying pan and the coordination gain parameters in Cartesian space by using a mixture of proportional-derivative systems with full stiffness matrices. Rosenstein et al. (2006) presented a simple random search approach to increase the payload capacity of a weightlifting robot by exploiting the robot’s intrinsic dynamics at a synergy level. Via-points are learned by exploration in the first phase of learning. RL and simple random search are then used to refine the joint coordination matrices initially defined as identity gains.

RL has been applied previously in the context of bipedal walking optimization, as in Calandra et al. (2014), Deisenroth et al. (2012). However, the goal for optimization is usually achieving the fastest possible gait without any regard to the energy consumption. In contrast, this paper focuses on the energy efficiency as the main optimization goal while at the same time maintaining the walking pace and speed unchanged.

Certain studies in biomechanics field indicate different aspects of energy-efficient locomotion. Biological systems, for instance, humans, store and release elastic potential energy into/from muscles and tendons during daily activities such as walking (Ishikawa et al. 2005). The management of the elastic potential energy that is stored in these biological structures is essential for reducing the energy consumption and for achieving mechanical power peaks. In this connection, vertical CoM movement appears to be a crucial factor in reducing the metabolic cost (Ortega and Farley 2005).

Recent advances in robotics and mechatronics have allowed for the creation of a new generation of passively-compliant bipedal robots, such as COMAN (Ugurlu et al. 2014). Similar to biological systems, elastic structures in this robot can store and release energy, which can be extremely helpful if properly used. However, it is difficult to pre-engineer an analytical way to utilize the passive compliance for dynamic walking tasks. One possible application could be the utilization of the passive compliance via machine learning for the energy-efficient bipedal walking generation task. In this paper, we present an approach that minimizes the walking energy by learning a varying-CoM-height walking which efficiently uses the passive compliance of the robot. In doing so, an incisive combination of machine learning and biomechanics could be exploited in a way to enhance an existing technology in bipedal locomotion control.

2.5 Novelty

In this paper, we develop a learning-based method integrated for learning to minimize the walking energy required for a passively-compliant bipedal robot. The energy minimization problem is challenging due to the difficulties in accurate modeling considering the properties of the springs, the dynamics of the whole robot and various nonlinearities.

The contributions in this paper can be categorized in two fractions: (i) Evolving policy parametrization. (ii) The first experimentally demonstrated walking energy minimization for fully actuated 3D bipeds, through the utilization of passive compliance.

First, we introduce a novel reinforcement learning technique which allows the use of adjustable-over-time policy parametrization. The proposed learning mechanism can incrementally evolve the policy parametrization as necessary, starting from a very simple parametrization and gradually increasing its complexity (i.e., resolution), and therefore, its representational power. We call this mechanism evolving policy parametrization and introduce a practical method to implement it using splines.

Second, we exploit the passive compliance built into our bipedal robot, in order to minimize the energy needed for walking. Using the proposed reinforcement learning algorithm, it is possible to find the optimal vertical CoM trajectory which minimizes the consumed energy. To this end, the authors would like to highlight the fact that this paper reports the first experimental results in which the physical compliance is successfully utilized in walking energy minimization task on a fully actuated and compliant 3D bipedal robot.

An early version of this paper containing preliminary experimental results was presented (Kormushev et al. 2011b). The current paper is significantly expanded and improved to provide an archival report, which reports evolving policy parametrization technique and elaborates numerous details about the approach, its implementation and application, newly-added experiment results with thorough analyses and exhaustive discussion on the results.

3 Evolving policy parametrization

We present an RL approach that allows to dynamically change the complexity, i.e., resolution, of the policy representation while the reinforcement learning process is running without losing any portion of the collected data, and without having to restart the learning process. We propose a mechanism which can incrementally evolve the policy parametrization as necessary, starting from a very simple parametrization and gradually increasing its complexity, and thus, its representational power. The goal is to create an adaptive policy parametrization, which can automatically grow to accommodate increasingly more complex policies and get closer to the global optimum. Due to a very desirable effect of this mechanism is that the tendency of converging to a sub-optimal solution is reduced, because in the lower-dimensional representations this effect is less exhibited, and gradually increasing the complexity of the parametrization helps to avoid getting caught in a poor local optimum.

To achieve this goal, the most important property which a policy encoding should provide is backward compatibility. This means that it should be able to represent subsequent policies such that it is backward-compatible with the previously collected data, such as past rollouts, their corresponding policies, and rewards. In general, it is possible to consider cases in which simplifying the policy parametrization might be useful, but in this work we assume that we only want to increase the complexity of the policy over time, and never to reduce it.

3.1 Spline policy representation

One of the simplest representations which have the property of backward compatibility is the geometric splines. For example, if we have a cubic spline with K knots (or via-points), and then we increase the number of knots, we can still preserve the exact shape of the generated curve (trajectory) by the spline. In fact, placing an additional knot between every two consecutive knots of the original spline would result with \(2K-1\) knots and a spline which coincides with the original spline.

Based on this fact, we propose to use of the spline knots as a policy parametrization and exploit the spline backward compatibility property for evolving the policy parametrization without losing the previously collected data. In order to achieve this goal, we need to define an algorithm to adjust the parametrization from K to L knots (\(L > K\)), which is formulated in Algorithm 1. Without loss of generality, the values of the policy parameters are normalized in the range [0, 1], and appropriately scaled/shifted as necessary later upon use. Figure 1 illustrates the process of using spline representation for the evolving policy parametrization. Figure 2 shows an example for a reinforcement learning process using evolving policy parametrization to approximate an unknown function.

figure e
Fig. 1
figure 1

An example of an evolving policy parametrization based on spline representation of the policy. The set of spline knots is the policy parametrization. The spline values at the knots are the actual policy parameter values. The parametrization starts from 4 knots and evolves up to 32 knots, thus gradually increasing the resolution of the policy

Fig. 2
figure 2

Reinforcement learning process using evolving policy parametrization. The black trajectory is the unknown global optimum which the reinforcement learning algorithm is trying to approximate. The policy is represented as a trajectory (in green) and is encoded using a spline. The policy evolution is shown by changing the color from dark green for the older policies to bright green for the newer ones. The idea is to gradually evolve the policy, by increasing the number of knots of the spline representation and thus gradually increase the representational power of the policy parametrization. The process is done dynamically while the reinforcement learning algorithm is running (Color figure online)

3.2 Integrating the evolving policy parametrization into RL

The proposed technique for evolving the policy parametrization can be used with any policy-based RL algorithm. In this paper, we use the state-of-the-art Expectation Maximization-based RL algorithm PoWER (Kober and Peters 2011), due to its fast convergence and a low number of parameters that need tuning. This makes the algorithm appropriate for application directly on the real robot, where it is important to minimize the number of trials, and therefore, the danger of damaging the robot. To further speed up the learning process, we apply the proposed evolving policy parametrization which adaptively changes the resolution of the policy on the fly during the learning process.

In order to minimize the computational time, we evolve the policy parametrization only on those past rollouts which get selected by the importance sampling technique used by the PoWER algorithm. This way, it is not necessary to convert all previous rollouts to the latest policy parametrization. This effectively reduces the computational complexity from \(O(N^2)\) to only \(O(\sigma N)\), where N is the number of rollouts, and \(\sigma \) is the number of importance sampled rollouts at each RL iteration (\(\sigma \ll N\)). Usually, \(\sigma \) is a constant number with a value less than 10, which makes the complexity equivalent to O(N), and allows fast execution of the proposed approach for real-time applications.

3.3 Simulation experiment 1: function approximation with evolving spline representation

In order to evaluate the proposed reinforcement learning with evolving policy parametrization, we primarily conduct a simulation experiment.Footnote 1 The goal is to compare the proposed method with a conventional fixed policy parametrization method that uses the same reinforcement learning algorithm as a baseline. The following synthetic function \(\tilde{\tau }\) which is unknown to the learning algorithm is used as the goal for the optimization process.

$$\begin{aligned} \tilde{\tau }(t)&= 0.5 + 0.2 \sin (10t) + 0.07 \sin (20t) + \nonumber \\&\quad + 0.04 \sin (30t) + 0.04 \sin (50t), \end{aligned}$$
(1)

In (1) \(\tilde{\tau }\) is with domain \(t \in [0, 1]\), and range \(\tilde{\tau }(t) \in [0, 1]\). The learning algorithm is trying to approximate \(\tilde{\tau }\) by minimizing the difference between the policy-generated trajectory and the real trajectory.

The reward function used for the simulated experiment is defined as follows:

$$\begin{aligned} R(\tau ) = e^{- \int _{0}^{1}{[\tau (t)-\tilde{\tau }(t)]^2 dt}}, \end{aligned}$$
(2)

where \(R(\tau )\) is the return of a rollout (trajectory) \(\tau \).

Fig. 3
figure 3

Simulation Experiment 1: Comparison of the policy outputs from the RL algorithm. a With fixed policy parametrization (30-knot spline), b with evolving policy parametrization (from 4-knot to 30-knot spline). In black, the unknown to the algorithm global optimum which it is trying to approximate. In green, all the rollouts performed during the learning process. In red, the current locally-optimal discovered policy by each RL algorithm (Color figure online)

Fig. 4
figure 4

Simulation Experiment 1: Comparison of the convergence of the RL algorithm with fixed policy parametrization (30-knot spline) versus evolving policy parametrization (from 4-knot to 30-knot spline). The results are averaged over 20 runs of each of the two algorithms in simulation. The standard deviation is indicated with error bars. In addition to faster convergence and higher achieved rewards, the evolving policy parametrization also achieves lower variance compared to the fixed policy parametrization

Fig. 5
figure 5

Simulation Experiment 2: Comparison of the policy outputs from the RL algorithm. a With fixed policy parametrization (20-knot spline), b with evolving policy parametrization (from 4-knot to 20-knot spline). Blue bars indicate the obstacles. Green lines represent all the rollouts performed during the learning process. Red lines represent the current locally-optimal discovered policy by each RL algorithm (Color figure online)

Fig. 6
figure 6

Simulation Experiment 2: Comparison of the convergence of the RL algorithm with fixed policy parametrization (20-knot spline) versus evolving policy parametrization (from 4-knot to 20-knot spline). The results are averaged over 20 runs of each of the two algorithms in simulation. The standard deviation is indicated with error bars. In addition to faster convergence and higher achieved rewards, the evolving policy parametrization also achieves lower variance compared to the fixed policy parametrization

Figure 3 shows a comparison of the generated policy output produced by the proposed evolving policy parametrization method, compared with the output from the conventional fixed policy parametrization method. Due to the lower policy-space dimensionality at the beginning, the evolving policy parametrization approaches much faster the shape of the globally-optimal trajectory. The fixed policy parametrization suffers from inefficient exploration due to the high dimensionality, as well as from overfitting problems, as seen by the high-frequency oscillations of the discovered policies.

Figure 4 shows that the convergence properties of the proposed method are significantly better than the conventional approach, in terms of faster convergence, higher achieved rewards and lower variance.

3.4 Simulation experiment 2: trajectory planning for obstacle avoidance

To further evaluate the proposed RL algorithm with evolving policy parametrization, we have conducted a second, more challenging simulation experiment. In this case, the goal is to perform trajectory planning for obstacle avoidance in 2D space. The simulated environment can be examined in Fig. 5. The starting position is in the bottom-left corner with coordinates (0,0), and the goal position is in the top-right corner with coordinates (1,1). There are 6 obstacles (marked in blue) arranged in a way that creates three narrow openings with progressively smaller sizes in order to produce a challenging motion planning problem. Similarly, the same two methods are being tested and compared: (i) evolving policy parametrization method, and (ii) conventional fixed policy parametrization method. However, this time the reward function does not have the same smoothness properties as in the previous simulation experiments presented in Sect. 3.3. This is due to the fact that whenever a trajectory collides with an obstacle, it is terminated at that instant. Therefore, this introduces discontinuities in the reward landscape and is more challenging for the learning algorithm in both cases. Furthermore, the reward function is defined based on the distance from the last reached position before arriving at the goal position. This, again, is more challenging as it introduces multiple local optima in the reward landscape which tends to trap the learning algorithms and makes it harder to reach the global optimum.

Fig. 7
figure 7

Outline of the proposed approach for bipedal walking energy consumption minimization, showing details about each of the three components: reinforcement learning, walking generation, and real-world rollout execution. Note that all the components are run in real-time

Despite these challenges, we show that the proposed evolving policy parametrization method consistently outperforms the conventional fixed policy parametrization method. Figure 5 displays a comparison of the generated policy output produced by the proposed evolving policy parametrization method, compared with the output from the conventional fixed policy parametrization method. Due to the lower policy-space dimensionality at the beginning, the evolving policy parametrization is able to more quickly explore the 2D space and is able to navigate around the 6 obstacles in a much smoother way. For comparison, the fixed policy parametrization struggles to go through the second and third opening because of the difficulty to explore the high dimensional policy space. Moreover, it suffers from overfitting problems which produce undesired jitter in the produced trajectories. Finally, the convergence properties of the two methods are compared in Fig. 6 which again confirms that the proposed method performs significantly better than the conventional approach, in terms of faster convergence, higher achieved rewards, and better quality solutions. This makes the proposed method particularly useful for real-world trajectory-planning scenarios, as shown in the following sections on a bipedal walking robot.

4 Bipedal walking energy reduction

For a real-world evaluation of the proposed approach, we tackle the problem of bipedal walking energy minimization. The proposed RL method is used to learn a vertical trajectory for the CoM of the robot such that the potential elastic energy exchange is fully utilized during walking, in order to minimize the energy consumption. A high-level outline of the real-world experiment is shown in Fig. 7.

For the reinforcement learning component, an important difference from the simulated experiments is that here the RL policy (i.e., the vertical CoM trajectory) needs to be cyclic in time. This is necessary because walking motion must be executed periodically over many cycles. A single walking cycle includes a single support phase in which either the left foot or right foot is in swing mode. This phase is followed by a double support phase where both feet are in the stance mode. Subsequently, single support phases are swapped between left and right feet to generate continuous walking motion. In particular, continuous walking was important in our case for the purpose of assessing energy consumption. Duration values for the walking phases, as well as initialization periods, are provided in Table 1.

Figure 8 illustrates the process of creating a time-cyclic policy out of a single spline in which a single cycle time was contained in the interval [0, 1]. The red line represents the input policy in the form of a spline for a one cycle interval; the values at spline knots were obtained from the policy parametrization values. The green line represents the time-cyclic policy whose spline knots were copied from the policy values at one cycle interval. Since the time-cyclic policy represents the vertical CoM trajectory in bipedal walking, it guarantees that both position and velocity are continuous and in differentiable form.

Table 1 Sequence of walking phases
Fig. 8
figure 8

Illustration of the process for creating a time-cyclic policy out of a single spline. One time cycle is contained in the interval [0, 1]. In red, the input policy in the form of a spline (red line), where the values at the spline knots (red circles) come from the policy parametrization values. In green, the produced time-cyclic policy, where the green knots have values copied from the policy values at one cycle interval. The green spline is the output time-cyclic policy, which guarantees that both position and velocity of the CoM is a continuous and differentiable function (Color figure online)

Table 2 Basic specifications of the robot
Fig. 9
figure 9

The mechanical assembly of COMAN and its joint configuration. Joints with yellow color indicate SEA units (Color figure online)

4.1 Compliant bipedal robot COMAN

In order to explore compliant humanoid characteristics, we developed a bipedal robot at the Italian Institute of Technology, as a part of the European AMARSi project. Table 2 summarizes its mechanical specifications. The robot has a total of 15 active DoFs (Degree of Freedom); 6 DoFs in each leg and 3 DoFs at the waist to be able to obtain greater motion flexibility. Each active joint incorporates three position sensors (two absolute and one relative encoders) and one torque sensor. The robot is also equipped with two 6-axis Force/Torque sensors at the ankles and five single-axis load cells on the foot sole. In addition, it has a triaxial rate gyro sensor and an accelerometer, located at the pelvis. In its electronic hardware structure, the main controller is an Intel Core 2 Duo 1.5 GHz dual processor with 3.0 GB RAM, running on a 32-bit GNU/Linux operating system that includes a real-time Xenomai extension. Data communication is performed via a real-time Ethernet protocol called RTnet. Figure 9 displays the actual robot and its joint configuration.

In the first prototype, only pitch axis ankle and knee joints are equipped with passive compliant elements, i.e., springs (see Fig. 9, frames with yellow color). For the compliant actuation system in our bipedal robot, the main objectives are to satisfy dimensional and weight requirements while achieving high rotary stiffness within a compact structure. Regarding these requirements as well as the previously discussed issues, a series elastic actuator (SEA) module appears to be a very suitable candidate and it is presently implemented in our robot (Ugurlu et al. 2011). The rotary stiffness of these modules was set to an approximate value of 156 [Nm/rad] to maximize the walking efficiency while providing sufficient bandwidth for joint position tracking.

4.2 Evaluation of walking energy consumption

There are many ways in which energy can be measured. One possible approach is to estimate the mechanical energy from motor torque measurements and angular velocities. However, the problem with this approach is that it incorrectly includes the work done by gravity, and can only infer indirectly the actual electric power used for walking. Furthermore, electrical energy is definitely used by the motors even when the mechanical energy is zero, e.g., when the robot is only standing.

We propose, what we think is the best approach, to directly measure the electrical energy used by all the motors of the robot, which allows us to explicitly measure the value that we are trying to minimize. We use the formula \(P=IU\), linking the electric power P to the electric current I and the voltage U, and we integrate over time to calculate the consumed electric energy in Joules. The COMAN robot is equipped with both current and voltage sensing units at each motor so that we can accurately measure these values. Figure 10 shows the accumulated consumed electric energy values for the motor of each individual joint of COMAN, calculated as:

$$\begin{aligned} E_j(t_1, t_2) = \int _{t_1}^{t_2}I_j(t)U_j(t)\,dt , \end{aligned}$$
(3)

where j is a selected joint for which the energy consumption is calculated, and \([t_1, t_2]\) is the time interval.

Fig. 10
figure 10

Electric energy consumption of each leg of COMAN during a 4-cycle walk. Alternating between left and right foot support redistributes the weight on different joint motors and causes differences in the left-right energy consumption. Hip roll joints consume the highest energy due to the fact that they solely support the whole leg weight in the lateral plane while it is in swinging mode. a Left leg, b right leg

To evaluate the whole walking rollout, we define the energy consumption metric of a given rollout \(\tau \) to be the average electric energy consumed per walking cycle, and estimate it using the formula.

$$\begin{aligned} E(\tau ) = \frac{1}{c} \sum _{j \in J}{ E_j(t_1, t_2) } , \end{aligned}$$
(4)

where J is the set of joints in the sagittal plane (hip, knee, and ankle pitch of both legs, 6 in total) whose energy consumption we try to minimize.

In order to reduce the noise effects on the measurement, we make the robot walk for 16 s and collect the electric current and voltage measurements of the \(c=4\) consecutive walk cycles (4 repetitions of phases 7 to 10 in Table 1), which contain a total of 8 steps. Therefore, the value of \(t_1\) is the start of phase 7, and the time \(t_2\) is the end of phase 10 in the fourth cycle. Afterward, we average the energy consumption and employ this value as the estimate of the electric energy used for this walking rollout.

In this work, the main focus is the exploitation of passive compliance for energy efficiency which may be achieved with the help of springs. Therefore, we use the sum of all electric energy consumed by the motors controlling the motion in the sagittal plane, i.e., the hip, knee, and ankle pitch joints on both legs, in our evaluation metric. Even though hip pitch joints do not include series elasticity, they sufficiently contribute to the vertical CoM trajectory as they are dominant in the sagittal plane together with ankle pitch and knee joints; therefore, they were included in the metric.

Finally, we define the return of a rollout \(\tau \) as:

$$\begin{aligned} R(\tau ) = e^{-k E(\tau )}, \end{aligned}$$
(5)

where k is a scaling constant. The lower the energy consumed, the higher the reward is.

Fig. 11
figure 11

The experimental setup, showing a snapshot of the bipedal robot COMAN during one walking rollout execution

5 Real-world experiments

Based on the results of the simulation experiment, the proposed evolving policy parametrization method is chosen for the real-world walking experiment, due to its favorable characteristics for real-time applications. The experimental setup is shown in Fig. 11. The total distance traveled by the robot during our experiments is around 0.5 km. For the evaluation of the energy consumption, we did not include the traveled distance, as the speed of walking was the same for all rollouts because the stride length was fixed.

Figure 12 shows the convergence results from the walking experiments. The figure shows the convergence of the consumed energy over time during the reinforcement learning. Energy measurements are normalized with the maximum possible energy consumption in mind. Each rollout corresponds to a walking experiment that was executed. For each rollout, the average energy consumed per cycle (averaged over 8 walking steps, i.e., 4 full walk cycles) is shown. At rollout number 126 the lowest energy consumption was achieved, which is 18% lower than the initial energy consumption.

Fig. 12
figure 12

Results from the real-world minimization of the consumed energy for walking

Fig. 13
figure 13

The discovered optimal policy (in red) by the reinforcement learning. Among all tried 180 CoM trajectories (in green), which were executed on the real robot (Color figure online)

Figure 13 visualizes the discovered optimal policy by the RL algorithm, as well as all the intermediate 180 rollouts that were performed. Although the single and double support phase periods were determined in advance, the RL algorithm discovered the instant at which the heel strikes the ground (shown with dotted vertical line), and adjusted the trajectory so that the CoM height is bounced off upward in that exact same moment. Note that the CoM height trajectory is normalized by considering the maximum and minimum allowable values which are imposed by the kinematic structure of the robot. All trajectories have been made cyclic in time so that walking can be executed continuously over many cycles; see (Fig. 14).

ZMP response measurements with respect to the inertial frame are displayed in Fig. 15, for 7 consecutive steps. During single support phases, the support polygon is the supporting foot area which is illustrated with rectangles. When the robot is in a double support phase, the area swept between two feet becomes the support polygon. We illustrated this support polygon only once with a dashed cyan area, in which the robot switches from the first step to the second step. What is more, green areas stand for transition phases in which the robot motion is initiated from a stationary position or vice versa. Steps with odd numbers indicate the right leg’s single support phases whereas even numbers stand for the left leg’s single support phases. Based on this result, it is possible to examine that the ZMP response is always within the support polygon boundaries. As a result, we obtained dynamically equilibrated and feasible walking cycles throughout the experimentation period. That being the case, we were able to focus solely on the energy minimization problem, without worrying about auxiliary issues, such as the dynamic balance.

As previously stated, each compliant joint includes separate encoders to measure both link side (after the spring) and motor side angles. This feature enables us to record spring deflection variations, in a reliable way. To this end, right leg’s knee and ankle joint deflections are respectively illustrated in Figs. 16 and 17. In these figures, solid purple lines show the deflection variations while the CoM height is varied by the proposed algorithm. Solid green lines indicate deflection values when CoM height is fixed. Analogous trends are observed for the left leg as well and therefore not plotted.

Fig. 14
figure 14

A sequence of video snapshots from the real-world experiment with the lower body of the COMAN robot

6 Discussion of results

6.1 Discussions on evolving policy parametrization

A major advantage demonstrated by the proposed approach is the low variance of the generated policies. The lower exploratory variance combined with the faster convergence is the key factor for achieving higher rewards than the fixed parametrization.

With respect to the learning, the focus of the paper is not on the encoding scheme (splines), but on the evolving policy parametrization. Spline-based techniques have well-known limitations such as providing a non-autonomous (time-based) control policy, discarding variability and synergy information in the representation, and having difficulty to cope with unforeseen perturbations (Schaal et al. 2003; Peters and Schaal 2008b). Being aware of their limitations, splines provided us a simple encoding scheme to be used as a first step to study the possibility of dynamic evolution of the policy parametrization during the learning. In addition, splines provided us a straightforward way to implement a cyclic policy which spans continuously over many time cycles and is convenient for robot walking applications.

Fig. 15
figure 15

Actual ZMP measurements with respect to the inertial frame. In this experiment, the robot walked 7 steps ahead. Foot positions are also indicated with rectangles

Fig. 16
figure 16

Right knee springs deflection during walking. Similar deflection variations were also observed in the left knee joint (Color figure online)

Fig. 17
figure 17

Right ankle springs deflection during walking. Similar deflection variations were also observed in the left ankle joint (Color figure online)

In this study, the knots were increased along the way in a heuristic manner for the sake of simplicity. By observing the convergence rates, it is possible to devise a systematic method for the addition of knots along the iteration. This additional feature deserves further investigation and is addressed as a future work.

6.2 Discussions on bipedal walking and energy minimization

The passive compliance of our robot was previously exploited to generate periodic jumping patterns (Ugurlu et al. 2014). In this study, the base resonance frequency of the overall system was identified to be within \(0.925 \sim 1.04\) (Hz) frequency band. When the robot is vertically excited within a close proximity to the base resonance frequency, joint deflections are expected to be maximized. This enables us to maximize elastic potential energy stored in the springs. That being the case, it is possible to obtain walking cycles with lower energy demands.

Throughout the learning process, the RL algorithm produced vertical motions with relatively higher frequencies as seen in Fig. 13, resulting in bad score in energy minimization goal. Finally, the vertical CoM movement which was eventually learned by the RL algorithm produced cyclic motions with a frequency within the \(0.925 \sim 1.04\) (Hz) band; approximately 1.0 (Hz). This can be justified in Figs. 16 and 17. Spring deflections are measured to be about \(-11^{\circ }\), which is 97% of the maximum allowableFootnote 2 values. Therefore, the algorithm achieved bipedal walking energy minimization goal, as it successfully found the optimal vertical CoM trajectory.

The end result of the energy minimization is computed to be 18%, which may be regarded as a crucial value when considering real-time operation duration of bipedal robots. The authors would like to highlight the fact that this paper reports the first experimental results of a bipedal walking energy minimization task, achieved on a fully actuated 3D robot with spring-supported passively compliant joints. Furthermore, it allows us to operate COMAN in real-time for approximately 4.3 more hours while using Li-Ion on-board batteries. Due to this fact, the robot demands less for battery recharge and become more environment-friendly by effectively using the limited power source.

We would like to highlight the fact that the dynamic balance is guaranteed by the ZMP-based motion generator as it outputs walking trajectories for a given set of feasibly designated ZMP inputs, regardless of the CoM height variance. In other words, the vertical CoM trajectory is given by the RL algorithm beforehand and utilized in the ZMP-based walking generator to induce dynamically balanced horizontal CoM trajectories. Therefore, we may focus on the bipedal walking energy minimization task without having any concern related to the dynamic balance issue.

In the current configuration, spring deflections were already maximized as illustrated in Figs. 16 and 17, thanks to the RL algorithm. Therefore, 18% energy minimization appears to be the direct consequence of maximizing spring deflections. The amount of energy minimization may be further improved if springs in the joint are replaced with their softer counterparts. In doing so, elastic energy storage can be increased, however, the robot may suffer undesired vertical oscillations. Therefore, there is a trade-off between the energy minimization and dynamic balance, when determining the spring stiffness profile. We handled this problem throughout the design process by empirically trying various springs with different stiffness coefficients.

Variable stiffness actuators may remedy the stiffness adjustment problem of SEAs through the active regulation of the passive compliance in real-time (Jafari et al. 2013). Variable stiffness regulation plays an important role in human walking; humans actively change the joint stiffness to explore optimal walking patterns (Geyer et al. 2006; Hu et al. 2014). That being said, these actuators are still large-sized and may not be applicable to power humanoids in their current form. Therefore, learning variable stiffness for the legged robot control will be investigated once the necessary hardware improvements are introduced.

Due to hardware limitations, the current version of COMAN had passive compliance only in pitch axis knee and ankle joints. Therefore, the overall energy minimization is provided solely by 4 joints, whereas the rest of the 8 joints (roll axis joints, yaw axis joints, pitch axis hip joints) could not contribute to this task due to the lack of passive compliance. Currently, our design engineering team is working on the next generation COMAN bipedal robot which will have passive compliance utilized in all the joints. In principle, the proposed method may perform even better when conducting walking motion on a robot with passive compliance in all joints.

In this work, the idea of generating efficient walking pattern through the use of potential energy management has its roots from studies in biomechanics (Ishikawa et al. 2005; Geyer et al. 2006). Therefore, we used an abstracted model for the humanoids so as to fully focus our attention to exploit passive compliance for energy efficiency. While useful in their own right, abstracted models may have limitations in describing the complete robot behavior. With the advent of centroidal dynamics (Orin et al. 2013), efficient locomotion controllers were proposed (Carpentier et al. 2016; Herzog et al. 2016). An extension of centroidal dynamics for robots with passive compliance may be investigated as a future work to further improve the performance.

Energy minimization may also be achieved by altering bipedal walking generator parameters. Since the main focus of this paper was the exploitation of passive compliance, the additional investigation of energy-minimization via learning the optimized bipedal walking parameters will be a future work. That being said, our research group previously implemented the proposed learning algorithm for the efficient quadruped gaits. For details refer to Shen et al. (2012).

7 Concluding remarks

We proposed a reinforcement learning approach that can evolve the policy parametrization dynamically during the learning process. We showed that the gradually increasing representational power of the policy parametrization helps to find better policies faster than a fixed parametrization. We successfully applied it to a bipedal walking energy minimization task by utilizing a variable-CoM-height ZMP-based bipedal walking generator. The method achieved 18% reduction in energy consumption by learning to use efficiently the passive compliance of the robot, which is the first reported experimental walking energy minimization results in the state-of-the-art humanoid robotics.

As a future work, we plan to extend this work for more powerful movement representations, based on a superposition of basis motion fields (Kormushev et al. 2010). Another interesting direction for extension is towards learning of variable stiffness control, which is of particular interest in the context of energy minimization.