Autonomous Helicopter Flight Using Reinforcement Learning

Coates, Adam; Abbeel, Pieter; Ng, Andrew Y.

doi:10.1007/978-1-4899-7687-1_16

Adam Coates³,
Pieter Abbeel⁴ &
Andrew Y. Ng^5,6

887 Accesses
2 Citations

Access provided by CONRICYT-eBooks. Download reference work entry PDF

Definition

Helicopter flight is a highly challenging control problem. While it is possible to obtain controllers for simple maneuvers (like hovering) by traditional manual design procedures, this approach is tedious and typically requires many hours of adjustments and flight testing, even for an experienced control engineer. For complex maneuvers, such as aerobatic routines, this approach is likely infeasible. In contrast, reinforcement learning (RL) algorithms enable faster and more automated design of controllers. Model-based RL algorithms have been used successfully for autonomous helicopter flight for hovering, forward flight, and using apprenticeship learning methods for expert-level aerobatics. In model-based RL, the first one builds a model of the helicopter dynamics and specifies the task using a reward function. Then, given the model and the reward function, the RL algorithm finds a controller that maximizes the expected sum of rewards accumulated over time.

Motivation and Background

Autonomous helicopter flight represents a challenging control problem and is widely regarded as being significantly harder than control of fixed-wing aircraft (see, e.g., Leishman 2000; Seddon 1990). At the same time, helicopters provide unique capabilities such as in-place hover, vertical takeoff and landing, and low-speed maneuvering. These capabilities make helicopter control an important research problem for many practical applications.

Building autonomous flight controllers for helicopters, however, is far from trivial. When done by hand, it can require many hours of tuning by experts with extensive prior knowledge about helicopter dynamics. Meanwhile, the automated development of helicopter controllers has been a major success story for RL methods. Controllers built using RL algorithms have established state-of-the-art performance for both basic flight maneuvers, such as hovering and forward flight (Bagnell and Schneider 2001; Ng et al. 2004b), as well as being among the only successful methods for advanced aerobatic stunts. Autonomous helicopter aerobatics has been successfully tackled using the innovation of “apprenticeship learning,” where the algorithm learns by watching a human demonstrator (Abbeel and Ng 2004). These methods have enabled autonomous helicopters to fly aerobatics as well as an expert human pilot and often even better (Coates et al. 2008).

Developing autonomous flight controllers for helicopters is challenging for a number of reasons:

1.
Helicopters have unstable, high-dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynamics. As a consequence, all successful helicopter flight controllers (to date) have many parameters. Controllers with 10–100 gains are not atypical. Hand engineering the right setting for each of the parameters is difficult and time consuming, especially since their effects on performance are often highly coupled through the helicopter’s complicated dynamics. Moreover, the unstable dynamics, especially in the low-speed flight regime, complicates flight testing.
2.
Helicopters are underactuated: their position and orientation are representable using six parameters, but they have only four control inputs. Thus helicopter control requires significant planning and making trade-offs between errors in orientation and errors in desired position.
3.
Helicopters have highly complex dynamics: even though we describe the helicopter as having a 12-dimensional state (position, velocity, orientation, and angular velocity), the true dynamics are significantly more complicated. To determine the precise effects of the inputs, one would have to consider the airflow in a large volume around the helicopter, as well as the parasitic coupling between the different inputs, the engine performance, and the non-rigidity of the rotor blades. Highly accurate simulators are thus difficult to create, and controllers developed in simulation must be sufficiently robust that they generalize to the real helicopter in spite of the simulator’s imperfections.
4.
Sensing capabilities are often poor: for small remotely controlled (RC) helicopters, sensing is limited because the onboard sensors must deal with a large amount of vibration caused by the helicopter blades rotating at about 30 Hz, as well as higher frequency noise from the engine. Although noise at these frequencies (which are well above the roughly 10 Hz at which the helicopter dynamics can be modeled reasonably) might be easily removed by low pass filtering, this introduces latency and damping effects that are detrimental to control performance. As a consequence, helicopter flight controllers have to be robust to noise and/or latency in the state estimates to work well in practice.

Typical Hardware Setup

A typical autonomous helicopter has several basic sensors on board. An inertial measurement unit (IMU) measures angular rates and linear accelerations for each of the helicopter’s three axes. A 3-axis magnetometer senses the direction of the Earth’s magnetic field, similar to a magnetic compass (Fig. 1).

**Autonomous Helicopter Flight Using Reinforcement Learning, Fig. 1**

Attitude-only sensing, as provided by the inertial and magnetic sensors, is insufficient for precise, stable hovering, and slow-speed maneuvers. These maneuvers require that the helicopter maintains relatively tight control over its position error, and hence high-quality position sensing is needed. GPS is often used to determine helicopter position (with carrier-phase GPS units achieving sub-decimeter accuracy), but vision-based solutions have also been employed (Abbeel et al. 2007; Coates et al. 2008; Saripalliz et al. 2003).

Vibration adds errors to the sensor measurements and may damage the sensors themselves; hence, significant effort may be required to mount the sensors on the airframe (Dunbabin et al. 2004). Provided there is no aliasing, sensor errors added by vibration can be removed by using a digital filter on the measurements (though, again, one must be careful to avoid adding too much latency).

Sensor data from the aircraft sensors is used to estimate the state of the helicopter for use by the control algorithm. This is usually done with an extended Kalman filter (EKF). A unimodal distribution (as computed by the EKF) suffices to represent the uncertainty in the state estimates, and it is common practice to use the mode of the distribution as the state estimate for feedback control. In general the accuracy obtained with this method is sufficiently high that one can treat the state as fully observed.

Most autonomous helicopters have an onboard computer that runs the EKF and the control algorithm (Gavrilets et al. 2002a; La Civita et al. 2006; Ng et al. 2004a). However, it is also possible to use ground-based computers by sending sensor data by wireless to the ground and then transmitting control signals back to the helicopter through the pilot’s RC transmitter (Abbeel et al. 2007; Coates et al. 2008).

Helicopter State and Controls

The helicopter state $s$ is defined by its position ($p_{x},p_{y},p_{z}$), orientation (which could be expressed using a unit quaternion $q$), velocity ($v_{x},v_{y},v_{z}$), and angular velocity ($\omega _{x},\omega _{y},\omega _{z}$).

The helicopter is controlled via a 4-dimensional action space:

1.
$u_{1}$ and $u_{2}$: The lateral (left-right) and longitudinal (front-back) cyclic pitch controls (together referred to as the “cyclic” controls) cause the helicopter to roll left or right and pitch forward or backward, respectively.
2.
$u_{3}$: The tail rotor pitch control affects tail rotor thrust and can be used to yaw (turn) the helicopter about its vertical axis. In analogy to airplane control, the tail rotor control is commonly referred to as “rudder.”
3.
$u_{4}$: The collective pitch control (often referred to simply as “collective”) increases and decreases the pitch of the main rotor blades, thus increasing or decreasing the vertical thrust produced as the blades sweep through the air.

By using the cyclic and rudder controls, the pilot can rotate the helicopter into any orientation. This allows the pilot to direct the thrust of the main rotor in any particular direction, and thus fly in any direction, by rotating the helicopter appropriately.

Helicopter Flight as an RL Problem

Formulation

An RL problem can be described by a tuple $(S,\mathcal{A},T,H,s(0),R)$, which is referred to as a Markov decision process (MDP). Here $S$ is the set of states; $\mathcal{A}$ is the set of actions or inputs; $T$ is the dynamics model, which is a set of probability distributions; $\{P_{su}^{t}\}$ ($P_{su}^{t}(s^{{\prime}}\vert s,u)$ is the probability of being in state $s^{{\prime}}$ at time $t + 1$, given the state and action at time $t$ are $s$ and $u$); $H$ is the horizon or number of time steps of interest; $s(0) \in S$ is the initial state; $R : S \times \mathcal{A}\rightarrow \mathbb{R}$ is the reward function.

A policy $\pi = (\mu _{0},\mu _{1},\ldots,\mu _{H})$ is a tuple of mappings from states $S$ to actions $\mathcal{A}$, one mapping for each time $t = 0,\ldots,H$. The expected sum of rewards when acting according to a policy $\pi$ is given by $U(\pi ) = \mathrm{E}[\sum _{t\,=\,0}^{H}R(s(t),u(t))\vert \pi ]$. The optimal policy $\pi ^{{\ast}}$ for an MDP $(S,\mathcal{A},T,H,s(0),R)$ is the policy that maximizes the expected sum of rewards. In particular, the optimal policy is given by: $\pi ^{{\ast}} =\arg \max _{\pi }U(\pi )$.

The common approach to finding a good policy for autonomous helicopter flight proceeds in two steps: First one collects data from manual helicopter flights to build a model. (One could also build a helicopter model by directly measuring physical parameters such as mass, rotor span, etc. However, even when this approach is pursued, one often resorts to collecting flight data to complete the model.) Then one solves the MDP comprised of the model and some chosen reward function. Although the controller obtained, in principle, is only optimal for the learned simulator model, it has been shown in various settings that optimal controllers perform well even when the model has some inaccuracies (see, e.g., Anderson and Moore 1989).

Modeling

One way to create a helicopter model is to use direct knowledge of aerodynamics to derive an explicit mathematical model. This model will depends on a number of parameters that are particular to the helicopter being flown. Many of the parameters may be measured directly (e.g., mass, rotational inertia), while others must be estimated from flight experiments. This approach has been used successfully on several systems (see, e.g., Gavrilets et al. 2001, 2002b; La Civita 2003). However, substantial expert aerodynamics knowledge is required for this modeling approach. Moreover, these models tend to cover only a limited fraction of the flight envelope.

Alternatively, one can learn a model of the dynamics directly from flight data, with only limited a priori knowledge of the helicopter’s dynamics. Data is usually collected from a series of manually controlled flights. These flights involve the human sweeping the control sticks back and forth at varying frequencies to cover as much of the flight envelope as possible, while recording the helicopter’s state and the pilot inputs at each instant.

Given a corpus of flight data, various different learning algorithms can be used to learn the underlying model of the helicopter dynamics.

If one is only interested in a single flight regime, one could learn a linear model that maps from the current state and action to the next state. Such a model can be easily estimated using linear regression. (While the methods presented here emphasize time domain estimation, frequency domain estimation is also possible for the special case of estimating linear models Tischler and Cauffman 1992.) Linear models are restricted to small flight regimes (e.g., hover or inverted hover) and do not immediately generalize to full-envelope flight. To cover a broader flight regime, nonparametric algorithms such as locally weighted linear regression have been used (Bagnell and Schneider 2001; Ng et al. 2004b). Nonparametric models that map from current state and action to next state can, in principle, cover the entire flight regime. Unfortunately, one must collect large amounts of data to obtain an accurate model, and the models are often quite slow to evaluate.

An alternative way to increase the expressiveness of the model, without resorting to nonparametric methods, is to consider a time-varying model where the dynamics are explicitly allowed to depend on time. One can then proceed to compute simpler (say, linear) parametric models for each choice of the time parameter. This method is effective when learning a model specific to a trajectory whose dynamics are repeatable but vary as the aircraft travels along the trajectory. Since this method can also require a great deal of data (similar to nonparametric methods) in practice, it is helpful to begin with a non-time-varying parametric model fit from a large amount of data and then augment it with a time-varying component that has fewer parameters (Abbeel et al. 2006; Coates et al. 2008).

One can also take advantage of symmetry in the helicopter dynamics to reduce the amount of data needed to fit a parametric model. Abbeel et al. (2006) observe that – in a coordinate frame attached to the helicopter – the helicopter dynamics are essentially the same for any orientation (or position) once the effect of gravity is removed. They learn a model that predicts (angular and linear) accelerations – except for the effects of gravity – in the helicopter frame as a function of the inputs and the (angular and linear) velocity in the helicopter frame. This leads to a lower-dimensional learning problem, which requires significantly less data. To simulate the helicopter dynamics over time, the predicted accelerations augmented with the effects of gravity are integrated over time to obtain velocity, angular rates, position, and orientation.

Abbeel et al. (2007) used this approach to learn a helicopter model that was later used for autonomous aerobatic helicopter flight maneuvers covering a large part of the flight envelope. Significantly less data is required to learn a model using the gravity-free parameterization compared to a parameterization that directly predicts the next state as a function of current state and actions (as was used in Bagnell and Schneider (2001) and Ng et al. (2004b)). Abbeel et al. evaluate their model by checking its simulation accuracy over longer time scales than just a one-step acceleration prediction. Such an evaluation criterion maps more directly to the reinforcement learning objective of maximizing the expected sum of rewards accumulated over time (see also Abbeel and Ng 2005b).

The models considered above are deterministic. This normally would allow us to drop the expectation when evaluating a policy according to $\mathrm{E}\left [\sum _{t\,=\,0}^{H}R(s(t),u(t))\vert \pi \right ]$. However, it is common to add stochasticity to account for unmodeled effects. Abbeel et al. (2007) and Ng et al. (2004a) include additive process noise in their models. Bagnell and Schneider (2001) go further, learning a distribution over models. Their policy must then perform well, on expectation, for a (deterministic) model selected randomly from the distribution.

Control Problem Solution Methods

Given a model of the helicopter, we now seek a policy $\pi$ that maximizes the expected sum of rewards $U(\pi ) = \mathrm{E}\left [\sum _{t=0}^{H}R(s(t),u(t))\vert \pi \right ]$ achieved when acting according to the policy $\pi$.

Policy Search

General policy search algorithms can be employed to search for optimal policies for the MDP based on the learned model. Given a policy $\pi$, we can directly try to optimize the objective $U(\pi )$. Unfortunately, $U(\pi )$ is an expectation over a complicated distribution making it impractical to evaluate the expectation exactly in general.

One solution is to approximate the expectation $U(\pi )$ by Monte Carlo sampling: under certain boundedness assumptions, the empirical average of the sum of rewards accumulated over time will give a good estimate $\hat{U}(\pi )$ of the expectation $U(\pi )$. Naively applying Monte Carlo sampling to accurately compute, e.g., the local gradient from the difference in function value at nearby points requires very large amounts of samples due to the stochasticity in the function evaluation.

To get around this hurdle, the PEGASUS algorithm (Ng and Jordan 2000) can be used to convert the stochastic optimization problem into a deterministic one. When evaluating by averaging over $n$ simulations, PEGASUS initially fixes $n$ random seeds. For each policy evaluation, the same $n$ random seeds are used so that the simulator is now deterministic. In particular, multiple evaluations of the same policy will result in the same computed reward. A search algorithm can then be applied to the deterministic problem to find an optimum.

The PEGASUS algorithm coupled with a simple local policy search was used by Ng et al. (2004a) to develop a policy for their autonomous helicopter that successfully sustains inverted hover. Bagnell and Schneider (2001) proceed similarly, but use the “amoeba” search algorithm (Nelder and Mead 1964) for policy search.

Because of the searching involved, the policy class must generally have low dimension. Nonetheless, it is often possible to find good policies within these policy classes. The policy class of Ng et al. (2004a), for instance, is a decoupled, linear PD controller with a sparse dependence on the state variables. (For instance, the linear controller for the pitch axis is parametrized as $u_{2} = c_{0}(p_{x} - p_{x}^{{\ast}}) + c_{1}(v_{x} - v_{x}^{{\ast}}) + c_{2}\theta$, which has just three parameters, while the entire state is nine dimensional. Here, $p_{\cdot }$, $v_{\cdot }$, and $p_{\cdot }^{{\ast}}$, $v_{\cdot }^{{\ast}}$, respectively, are the actual and desired position and velocity. $\theta$ denotes the pitch angle.) The sparsity reduces the policy class to just nine parameters. In Bagnell and Schneider (2001), two-layer neural network structures are used with a similar sparse dependence on the state variables. Two neural networks with five parameters each are learned for the cyclic controls.

Differential Dynamic Programming

Abbeel et al. (2007) use differential dynamic programming (DDP) for the task of aerobatic trajectory following. DDP (Jacobson and Mayne 1970) works by iteratively approximating the MDP as linear quadratic regulator (LQR) problems. The LQR control problem is a special class of MDPs, for which the optimal policy can be computed efficiently. In LQR the set of states is given by $S = \mathbb{R}^{n}$, the set of actions/inputs is given by $\mathcal{A} = \mathbb{R}^{p}$, and the dynamics model is given by

$$\displaystyle{s(t + 1) = A(t)s(t) + B(t)u(t) + w(t),}$$

where for all $t = 0,\ldots,H$ we have that $A(t) \in \mathbb{R}^{n\times n},$$B(t) \in \mathbb{R}^{n\times p}$, and $w(t)$ is a mean zero random variable (with finite variance). The reward for being in state $s(t)$ and taking action $u(t)$ is given by

$$\displaystyle{-s(t)^{\top }Q(t)s(t) - u(t)^{\top }R(t)u(t).}$$

Here $Q(t),R(t)$ are positive semi-definite matrices which parameterize the reward function. It is well known that the optimal policy for the LQR control problem is a linear feedback controller which can be efficiently computed using dynamic programming (see, e.g., Anderson and Moore (1989), for details on linear quadratic methods).

DDP approximately solves general continuous state-space MDPs by iterating the following two steps until convergence:

1.
Compute a linear approximation to the nonlinear dynamics and a quadratic approximation to the reward function around the trajectory obtained when executing the current policy in simulation.
2.
Compute the optimal policy for the LQR problem obtained in Step 1, and set the current policy equal to the optimal policy for the LQR problem.

During the first iteration, the linearizations are performed around the target trajectory for the maneuver, since an initial policy is not available.

This method is used to perform autonomous flips, rolls, and “funnels” (high-speed sideways flight in a circle) in Abbeel et al. (2007) and autonomous autorotation (autorotation is an emergency maneuver that allows a skilled pilot to glide a helicopter to a safe landing in the event of an engine failure or tail-rotor failure) in Abbeel et al. (2008), Fig. 2.

**Autonomous Helicopter Flight Using Reinforcement Learning, Fig. 2**

While DDP computes a solution to the nonlinear optimization problem, it relies on the accuracy of the nonlinear model to correctly predict the trajectory that will be flown by the helicopter. This prediction is used in Step 1 above to linearize the dynamics. In practice, the helicopter will often not follow the predicted trajectory closely (due to stochasticity and modeling errors), and thus the linearization will become a highly inaccurate approximation of the nonlinear model. A common solution to this, applied by Coates et al. (2008), is to compute the DDP solution online, linearizing around a trajectory that begins at the current helicopter state. This ensures that the model is always linearized around a trajectory near the helicopter’s actual flight path.

Apprenticeship Learning and Inverse RL

In computing a policy for an MDP, simply finding a solution (using any method) that performs well in simulation may not be enough. One may need to adjust both the model and reward function based on the results of flight testing. Modeling error may result in controllers that fly perfectly in simulation but perform poorly or fail entirely in reality. Because helicopter dynamics are difficult to model exactly, this problem is fairly common. Meanwhile, a poor reward function can result in a controller that is not robust to modeling errors or unpredicted perturbations (e.g., it may use large control inputs that are unsafe in practice). If a human “expert” is available to demonstrate the maneuver, this demonstration flight can be leveraged to obtain a better model and reward function.

The reward function encodes both the trajectory that the helicopter should follow and the trade-offs between different types of errors. If the desired trajectory is infeasible (either in the nonlinear simulation or in reality), this results in a significantly more difficult control problem. Also, if the trade-offs are not specified correctly, the helicopter may be unable to compensate for significant deviations from the desired trajectory. For instance, a typical reward function for hovering implicitly specifies a trade-off between position error and orientation error (it is possible to reduce one error, but usually at the cost of increasing the other). If this trade-off is incorrectly chosen, the controller may be pushed off course by wind (if it tries too hard to keep the helicopter level) or, conversely, may tilt the helicopter to an unsafe attitude while trying to correct for a large position error.

We can use demonstrations from an expert pilot to recover both a good choice for the desired trajectory and good choices of reward weights for errors relative to this trajectory. In apprenticeship learning, we are given a set of $N$ recorded state and control sequences, $\{s_{k}(t),u_{k}(t)\}_{t\,=\,0}^{H}$ for $k\, =\, 1,\ldots,N$, from demonstration flights by an expert pilot. Coates et al. (2008) note that these demonstrations may be suboptimal but are often suboptimal in different ways. They suggest that a large number of expert demonstrations may implicitly encode the optimal trajectory and propose a generative model that explains the expert demonstrations as stochastic instantiations of an “ideal” trajectory. This is the desired trajectory that the expert has in mind but is unable to demonstrate exactly. Using an Expectation-Maximization (Dempster et al. 1977) algorithm, they infer the desired trajectory and use this as the target trajectory in their reward function.

A good choice of reward weights (for errors relative to the desired trajectory) can be recovered using inverse reinforcement learning (Ng and Russell 2000; Abbeel and Ng 2004). Suppose the reward function is written as a linear combination of features as follows: $R(s,u) = c_{0}\phi _{0}(s,u) + c_{1}\phi _{1}(s,u) + \cdots$. For a single recorded demonstration, $\{s(t),u(t)\}_{t=0}^{H}$, the pilot’s accumulated reward corresponding to each feature can be computed as $c_{i}\phi _{i}^{{\ast}} = c_{i}\sum _{t=0}^{H}\phi _{i}(s(t),u(t))$. If the pilot outperforms the autonomous flight controller with respect to a particular feature $\phi _{i}$, this indicates that the pilot’s own “reward function” places a higher value on that feature, and hence its weight $c_{i}$ should be increased. Using this procedure, a good choice of reward function that makes trade-offs similar to that of a human pilot can be recovered. This method has been used to guide the choice of reward for many maneuvers during flight testing (Abbeel et al. 2007, 2008; Coates et al. 2008).

In addition to learning a better reward function from pilot demonstration, one can also use the pilot demonstration to improve the model directly and attempt to reduce modeling error. Coates et al. (2008), for instance, use errors observed in expert demonstrations to jointly infer an improved dynamics model along with the desired trajectory. Abbeel et al. (2007), however, have proposed the following alternating procedure that is broadly applicable (see also Abbeel and Ng (2005a) for details):

1.
Collect data from a human pilot flying the desired maneuvers with the helicopter. Learn a model from the data.
2.
Find a controller that works in simulation based on the current model.
3.
Test the controller on the helicopter. If it works, we are done. Otherwise, use the data from the test flight to learn a new (improved) model and go back to Step 2.

This procedure has similarities with model-based RL and with the common approach in control to first perform system identification and then find a controller using the resulting model. However, the key insight from Abbeel and Ng (2005a) is that this procedure is guaranteed to converge to expert performance in a polynomial number of iterations. The authors report needing at most three iterations in practice. Importantly, unlike the $E^{3}$ family of algorithms (Kearns and Singh 2002), this procedure does not require explicit exploration policies. One only needs to test controllers that try to fly as much as possible (according to the current choice of dynamics model). (Indeed, the $E^{3}$-family of algorithms (Kearns and Singh 2002) and its extensions (Kearns and Koller 1999; Brafman and Tennenholtz 2002; Kakade et al. 2003) proceed by generating “exploration” policies, which try to visit inaccurately modeled parts of the state space. Unfortunately, such exploration policies do not even try to fly the helicopter well and thus would almost invariably lead to crashes.)

The apprenticeship learning algorithms described above have been used to fly the most advanced autonomous maneuvers to date. The apprenticeship learning algorithm of Coates et al. (2008), for example, has been used to attain expert level performance on challenging aerobatic maneuvers as well as entire air shows composed of many maneuvers in rapid sequence. These maneuvers include in-place flips and rolls, tic-tocs (“tic-toc” is a maneuver where the helicopter pitches forward and backward with its nose pointed toward the sky (resembling an inverted clock pendulum)), and chaos. (“Chaos” is a maneuver where the helicopter flips in place but does so while continuously pirouetting at a high rate. Visually, the helicopter body appears to tumble chaotically while nevertheless remaining in roughly the same position.) (See Fig. 3.) These maneuvers are considered among the most challenging possible and can only be performed by advanced human pilots. In fact, Coates et al. (2008) show that their learned controller performance can even exceed the performance of the expert pilot providing the demonstrations, putting many of the maneuvers on par with professional pilots (Fig. 4).

**Autonomous Helicopter Flight Using Reinforcement Learning, Fig. 3**

**Autonomous Helicopter Flight Using Reinforcement Learning, Fig. 4**

A similar approach has been used in Abbeel et al. (2008) to perform the first successful autonomous autorotations. Their aircraft has performed more than 30 autonomous landings successfully without engine power.

Not only do apprenticeship methods achieve state-of-the-art performance, but they are among the fastest learning methods available, as they obviate the need for arduous hand tuning by engineers. Coates et al. (2008), for instance, report that entire air shows can be created from scratch with just 1 h of work. This is in stark contrast to previous approaches that may have required hours or even days of tuning for relatively simple maneuvers.

Conclusion

Helicopter control is a challenging control problem and has recently seen major successes with the application of learning algorithms. This entry has shown how each step of the control design process can be automated using machine learning algorithms for system identification and reinforcement learning algorithms for control. It has also shown how apprenticeship learning algorithms can be employed to achieve expert-level performance on challenging aerobatic maneuvers when an expert pilot can provide demonstrations. Autonomous helicopters with control systems developed using these methods are now capable of flying advanced aerobatic maneuvers (including flips, rolls, tic-tocs, chaos, and autorotation) at the level of expert human pilots.

Cross-References

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Adam Coates
EECS Department, UC Berkeley, Stanford, CA, USA
Pieter Abbeel
Stanford University, Stanford, CA, USA
Andrew Y. Ng
Computer Science Department, Stanford University, Stanford, CA, USA
Andrew Y. Ng

Authors

Adam Coates
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Abbeel
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Y. Ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Coates .

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Claude Sammut
Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Coates, A., Abbeel, P., Ng, A.Y. (2017). Autonomous Helicopter Flight Using Reinforcement Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_16

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7687-1_16
Published: 14 April 2017
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7685-7
Online ISBN: 978-1-4899-7687-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Autonomous Helicopter Flight Using Reinforcement Learning

Definition

Motivation and Background

Typical Hardware Setup

Helicopter State and Controls