1 Introduction

Robots have been widely used in industrial tasks, such as assembly, grinding, and polishing. Classical programming methods are teaching robots to perform repeat motion by defining key frames using teach pendant or off-line programming software with CAD models. However, the position control-based method is not suitable for high precision assembly with narrow clearance, e.g. peg-in-hole insertion, due to low tracking accuracy and environmental uncertainties. Traditional methods are based on contact state models. These methods recognize the contact state according to the contact state model, and then use the corresponding control strategy to complete the assembly task. Commonly used contact state models include analytical model, such as quasi-static model (Whitney 1982; Kim et al. 1999), dynamic model (Xia et al. 2006), etc., and statistical model, such as Support Vector Machine (SVM) (Jakovljevic et al. 2012), fuzzy classifier (Jakovljevic et al. 2014), Gaussian Mixture Model (GMM) (Jasim et al. 2017), Hidden Markov Model (HMM) (Hannaford and Lee 1991) and so on. This kind of methods analyzes the assembly process with physical principles intrinsically and succinctly. However, these methods must identify and tune many parameters. This procedure is time-consuming or even impossible for some complex scenarios (Li et al. 2019).

In recent years, the learning-based method brought new opportunities to directly learn assembly policy without contact state models (Xu et al. 2019a). Because of its simple implementation and good versatility, this kind of method has been widely concerned in recent years. Among them, Learning from Demonstration (LfD) (Billard et al. 2008) and Reinforcement Learning (RL) (Sutton and Barto 2018) are two widely used methods. Considering that human beings can easily complete various complex assembly tasks, LfD methods devote to imitate human assembly skills. Kronander (Kronander et al. 2014; Kronander 2015) modeled the joint probability distribution of the sensed wrench and angular velocity of the end effector with GMM, and then, Gaussian Mixture Regression (GMR) is used to generate the probability distribution of angular velocity conditioned on the sensed wrench. Based on this idea, Tang (Tang et al. 2016; Tang 2018) further proved the stability of the GMR controller and developed a handheld device to collect demonstrations. Kramberger et al. (2017) proposed a Dynamic Movement Primitive (DMP) based method to generalize orientational trajectories as well as the accompanying force-torque profiles for assembly tasks. Gao et al. (2019) encoded the motion and force information as a joint probability distribution model and then used a task planner and an adaptive control policy to accomplish assembly tasks. However, the performance of LfD methods completely depends on the quality of the demonstrations. To achieve good performance, the operator must demonstrate as many states as possible, which increases the operator’s workload.

Different from the LfD method, the RL method learns skills through trial and error similar to the way humans learn. According to whether the task dynamic model is used in training, RL methods are divided into model-free and model-based methods. Model-free methods directly learn policy from data without considering the task dynamic model. Inoue et al. (2018) trained a recurrent neural network with Deep Q Network (DQN) (Silver et al. 2016) to learn a high-precision peg-in-hole assembly policy. Li et al. (2019) presented a deep RL method for a complex assembly process. In this method, the reward function uses a trained classification model to recognize whether the assembly is successful. However, actions in these methods are discrete, which inevitably leads to discontinuous movements. To solve this problem, Xu et al. (2019b) proposed a model-driven Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. 2015) algorithm for multiple peg-in-hole assembly tasks. Ren et al. (2018) generated virtual trajectory and stiffness with a DDPG based variable compliance controller. However, Since DDPG is a deterministic policy method, to explore more states in training, noise must be added in the action or parameter spaces, which makes it difficult to stabilize and brittle to hyper-parameter settings. To solve this problem, Haarnoja et al. (2017) proposed a maximum entropy-based algorithm to learn stochastic policy. Later, Haarnoja et al. extended this algorithm and proposed soft actor-critic (SAC) (Haarnoja et al. 2018), a state-of-the-art off-policy actor-critic deep RL algorithm which provides sample-efficient learning while retaining the benefits of entropy maximization and stability. However, those model-free approaches need a significant amount of data samples, and collecting such excessive data is time-consuming and even may lead to physical damage in real robotic assembly tasks, which seriously limits the application of these methods in practice.

In contrast to model-free RL methods, model-based RL methods (Polydoros and Nalpantidis 2017) first learn the environment dynamics with deterministic models such as Locally Weighted Linear Regression (LWLR), Receptive Field Weighted Regression (RFWR), Decision Trees, etc. or stochastic models such as Gaussian Processes (GP), Locally Weighted Bayesian Regression (LWLR), etc., and then optimize trajectories with an optimal controller such as iterative Linear Quadratic Regulator (iLQR) or iterative Linear Quadratic Gaussian (iLQG). In these methods, the required number of samples can be reduced. Thomas et al. (2018) combined motion planning with Guided Policy Search (GPS) to learn assembly skills with prior knowledge from CAD models. Luo et al. (2018) proposed a Mirror Descent Guided Policy Search (MDGPS) based method for inserting a rigid peg into a deformable hole task. However, the performance of this kind of methods heavily depends on the accuracy of the dynamics model. Actually, learning such an accurate model in assembly tasks is difficult, which in turn affects the policy convergence. In addition, the learned skill is trained for the specific task and is difficult to be transferred to new scenarios. In a word, RL based methods can learn assembly skills by trial and error without contact state model and human demonstrations. However, it still challenging to robustly and efficiently learn transferable assembly skills.

In this paper, by integrating model-free and model-based learning techniques, a model accelerated RL method is proposed for high-precision assembly tasks, which robustly and efficiently learns assembly skills while minimizing real-world interaction requirements. In the training process, the assembly policy is learned with maximum entropy RL framework and executed by an impedance controller, which can guarantee exploration efficiency and safety and allow transferring the learned skill to new tasks. Meanwhile, the environment dynamics are learned with GP to improve target value estimation and generate virtual transitions to argument data samples, which in turn improves the training efficiency.

The rest of this paper is organized as follows: Sect. 2 introduces the peg-in-hole task and impedance control. Section 3 formulates the problem and introduces the learning algorithm. Section 4 introduces the model acceleration method and summarizes the proposed method in detail. Section 5 evaluates the proposed method with experiments on an industrial robot. Finally, Sect. 6 concludes the paper.

2 Robotic assembly task with impedance control

The peg-in-hole task can be divided into two stages, i.e. searching and insertion. In the searching stage, the robot moves to the position within the clearance region of the hole by means of vision system assistance or blind search method (S., R.C., M., S.B. 2001). After contact with the hole, it enters the more complicated insertion stage. In this stage, because the clearance is usually smaller than the motion accuracy of the robot, the insertion task cannot be fulfilled with only position control. For this reason, it is often necessary to install a force sensor on the robot end-effector and use the feedback force to realize the compliance control of the manipulator.

Impedance control is a widely used compliance control method, which keeps the robot end-effector and the environment maintaining a contact relationship like the spring damping system by controlling the position and velocity of the robot. In impedance control, the relationship between force and position can be described by impedance equation:

$${\mathbf{M}}_{d} {\mathbf{\ddot{e}}} + {\mathbf{K}}_{D} {\dot{\mathbf{e}}} + {\mathbf{K}}_{P} {\mathbf{e}} = {\mathbf{T}}^{T} \left( {{\mathbf{x}}_{e} } \right)\left( {{\mathbf{f}}_{e} - {\mathbf{f}}_{d} } \right)\;,$$
(1)

where \({\mathbf{M}}_{d}\), \({\mathbf{K}}_{D}\) and \({\mathbf{K}}_{P}\) are the positive-definite mass, damping and stiffness matrix respectively, \({\mathbf{e}}\) is the difference between the actual end-effector pose \({\mathbf{x}}_{e}\) and the desired pose \({\mathbf{x}}_{d}\), the orientation of the end effector is represented by Euler angle \({\mathbf{\varphi }} = \left[ {\varphi_{x} ,\varphi_{y} ,\varphi_{z} } \right]\) in form of XYZ, \({\mathbf{f}}_{e}\) and \({\mathbf{f}}_{d}\) are the actual and desired contact force respectively, \({\mathbf{T}}\) is the transformation matrix between analytic and geometric Jacobian defined as (Siciliano et al. 2010):

$$T = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & {\sin \varphi_{y} } \\ 0 & 0 & 0 & 0 & {\cos \varphi_{x} } & { - \sin \varphi_{x} \cos \varphi_{y} } \\ 0 & 0 & 0 & 0 & {\sin \varphi_{x} } & {\cos \varphi_{x} \cos \varphi_{y} } \\ \end{array} } \right]\;.$$

After the robot end-effector contacts with the environment, to control the desired contact force accurately under steady states, namely, \({\mathbf{f}}_{e} = {\mathbf{f}}_{d}\), the stiffness matrix is usually set \({\mathbf{K}}_{P} = {\mathbf{0}}\). Therefore, Eq. (1) becomes

$${\mathbf{M}}_{d} {\mathbf{\ddot{e}}} + {\mathbf{K}}_{D} {\dot{\mathbf{e}}} = {\mathbf{T}}^{T} \left( {{\mathbf{x}}_{e} } \right)\left( {{\mathbf{f}}_{e} - {\mathbf{f}}_{d} } \right)\;.$$
(2)

In assembly tasks, it is generally considered that the environment is static and its position cannot be accurately obtained. Therefore, the desired velocity and acceleration are set \({\mathbf{\ddot{x}}}_{d} = {\dot{\mathbf{x}}}_{d} = 0\). For convenient applications in the control of industrial robots, the controller command should be converted to its discrete format as

$$\begin{gathered} {\mathbf{\ddot{x}}}_{c} \left( t \right) = {\mathbf{M}}_{d}^{ - 1} \left[ {{\mathbf{T}}^{T} \left( {{\mathbf{x}}_{e} } \right)\left( {{\mathbf{f}}_{e} - {\mathbf{f}}_{d} } \right) - {\mathbf{K}}_{D} {\dot{\mathbf{x}}}_{c} \left( {t - 1} \right)} \right]\;, \hfill \\ {\dot{\mathbf{x}}}_{c} \left( t \right) = {\dot{\mathbf{x}}}_{c} \left( {t - 1} \right) + {\mathbf{\ddot{x}}}_{c} \left( t \right)T\;, \hfill \\ {\mathbf{x}}_{c} \left( t \right) = {\mathbf{x}}_{c} \left( {t - 1} \right) + {\dot{\mathbf{x}}}_{c} \left( t \right)T\;, \hfill \\ \end{gathered}$$
(3)

where T is the system communication period between the impedance controller and the robot motion controller, \({\mathbf{x}}_{c} \left( t \right)\) is the pose command in Cartesian space. After \({\mathbf{x}}_{c} \left( t \right)\) is obtained, the joint position command can be calculated using the inverse kinematics, and the robot can be driven by sending it to the robot controller.

With the impedance controller, the peg can adjust in a small range to complete insertion, however, when the error is large, especially when the angular error is large, it will enter the jamming or wedging state. In this situation, the pose error cannot be eliminated with force controller. To solve this problem, a RL-based method is proposed in this paper. The framework of the proposed method is shown in Fig. 1. In this method, the desired contact force \({\mathbf{f}}_{d}\) is controlled with the policy learned from trial and error, and then the desired force \({\mathbf{f}}_{d}\) is used as the input of the impedance controller, finally, the robot position is adjusted to eliminate the pose error to complete the assembly task. In the following, the RL algorithm for the assembly policy is first introduced, and then, a model acceleration method is presented to improve the training efficiency.

Fig. 1
figure 1

Framework of the RL-based assembly method

3 Reinforcement learning for robotic assembly

3.1 Problem formulation

The RL problem is usually formulated as a Markov decision process, which is defined as \({\mathcal{M}} = (S,A,P,R,\gamma )\), where S and A are the set of agent’s states and actions respectively, \(P\left( {{\mathbf{s}}_{t + 1} |{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right)\) is the probability of transition to state \({\mathbf{s}}_{t + 1}\) from state \({\mathbf{s}}_{t}\) by executing action \({\mathbf{a}}_{t}\), also known as system dynamics. At each time step t, the robot chooses and executes an action \({\mathbf{a}}_{t}\) according to the assembly policy \(\pi \left( {{\mathbf{a}}_{t} |{\mathbf{s}}_{t} } \right)\), and receives a reward \(R\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} } \right)\), \(\gamma \in (0,1]\) is the discount factor. The goal of RL is to find a policy \(\pi\) to maximize the expectation of discounted returns from the initial state.

$$J = {\mathbb{E}}_{{r_{i} ,s_{i} \sim E,a_{i} \sim \pi }} \left[ {\sum\limits_{i = 0}^{\infty } {\gamma^{i} } R\left( {{\mathbf{s}}_{i} ,{\mathbf{a}}_{i} ,{\mathbf{s}}_{i + 1} } \right)} \right]\;.$$
(4)

To learn assembly policy with RL algorithms, the state and action space, and reward function must be defined first.

Because the contact force can characterize the contact state, therefore, it must be considered in the system state. The peg-in-hole assembly model is shown in Fig. 2. The contact force measured by the force sensor installed on the peg is \({\mathbf{h}}^{s} = \left[ {f_{x}^{s} ,f_{y}^{s} ,f_{z}^{s} ,m_{x}^{s} ,m_{y}^{s} ,m_{z}^{s} } \right]^{T}\). This force is represented in the force sensor coordinate \(x^{s} - y^{s} - z^{s}\), however, in the peg-in-hole task, all the movements are represented in the hole coordinate i.e. \(x^{h} - y^{h} - z^{h}\), therefore, the contact force should be transformed into the hole coordinate

$${\mathbf{h}} = \left[ {\begin{array}{*{20}c} {R_{s} } & 0 \\ 0 & {R_{s} } \\ \end{array} } \right]{\mathbf{h}}^{s} \;,$$
(5)

where \(R_{s}\) is the rotation of the force sensor relative to the hole coordinate. It is worth noting that only the direction of force is converted to the world coordinate system, the origin of the moment is still selected at the origin of the force sensor.

Fig. 2
figure 2

State in insertion task

In addition, the insertion depth \(p_{z}\) is also needed to indicate the current task completion progress, and the axial displacement increment along the hole \(\Delta p_{z} = p^{\prime}_{z} - p_{z}\) can be used to determine whether the system is jammed. Therefore, the system status is expressed as:

$${\mathbf{s}} = \left[ {f_{x} ,f_{y} ,f_{z} ,m_{x} ,m_{y} ,p_{z} ,\Delta p_{z} } \right]\;.$$
(6)

For cylindrical peg-in-hole insertion task, the axial moment \(m_{z}\) is usually not considered. Because the exact positions of the hole or peg are not used in the state space, the system state will not be affected by the change of the hole’s position.

There are many ways to define the action space, such as joint torque, velocity. However, these action space definitions are not only inefficient for training, but also dangerous for robot and parts before policy converges. Therefore, in this paper, the action is defined as the expected contact force \({\mathbf{f}}_{d}\) in the impedance control model. In the executing phase, the peg pose is controlled by an impedance controller according to the action \({\mathbf{f}}_{d}\). Compared with the ways of directly generating torque or velocity, using the desired contact force as the action and driving the robot through an impedance controller can avoid irrelevant or even dangerous actions. Therefore, it can improve the training efficiency and ensure the safety of the robot and parts before the policy converges. Therefore, the action is defined as:

$${\mathbf{a}} = \left[ {f_{x}^{d} ,f_{y}^{d} ,f_{z}^{d} ,m_{x}^{d} ,m_{y}^{d} } \right]\;.$$
(7)

The reward function is defined as:

$$r\left( {{\mathbf{s}},{\mathbf{a}},{\mathbf{s^{\prime}}}} \right) = - \frac{1}{{N_{max} }} + \frac{{\Delta p_{z} }}{H} + r_{end} \;,$$
(8)

where \(N_{max}\) is the maximum number of steps allowed for one episode, if the number of steps is exceeded, this episode is considered a failure and ends immediately; H is the insertion goal depth, and \(\Delta p_{z} = p^{\prime}_{z} - p_{z}\) is the change of the insertion depth after action a is executed. \(r_{end}\) is the reward in the last step of an episode, defined as

$$r_{end} = \left\{ {\begin{array}{*{20}c} 1 & {p_{z} > H} \\ { - 1} & {{\mathbf{f}}_{min} > {\mathbf{f}}_{e} \;\;or\;\;{\mathbf{f}}_{max} < {\mathbf{f}}_{e} } \\ 0 & {others} \\ \end{array} } \right.\;,$$
(9)

where \({\mathbf{f}}_{min}\) and \({\mathbf{f}}_{max}\) are the minimum and maximum allowed contact force. The reward in Eq. (8) is designed to stay within the range of \(\left[ { - {1 \mathord{\left/ {\vphantom {1 {N_{max} }}} \right. \kern-\nulldelimiterspace} {N_{max} }} - 2, - {1 \mathord{\left/ {\vphantom {1 {N_{max} }}} \right. \kern-\nulldelimiterspace} {N_{max} }} + 2} \right]\). The first two items represent the immediate rewards in each step which encourages the actions that accomplish the task in the minimum number of steps and achieve as much insert progress as possible. These items can avoid the problem of low learning efficiency caused by sparse reward. Besides, the last item encourages the actions that successfully perform the task and punishes the actions that cause large contact force exceeding the allowed range.

3.2 Learning algorithm

The proposed method is based on SAC (Haarnoja et al. 2018), a model-free actor-critic algorithm in maximum entropy RL framework that can efficiently learn policies in high-dimensional continuous action space.

Traditional RL algorithms directly maximize the objective function in Eq. (4). However, in the maximum entropy RL framework, in addition to maximizing the cumulative return, it also needs to ensure the exploration ability of the policy. Therefore, the objective function is defined as:

$$J = \sum\limits_{t = 0}^{T} {{\mathbb{E}}_{{r_{t} ,{\mathbf{s}}_{t} \sim E,{\mathbf{a}}_{t} \sim \pi }} \left[ {r\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} } \right) + \alpha {\mathcal{H}}\left( {\pi \left( { \cdot |{\mathbf{s}}_{t} } \right)} \right)} \right]} \;,$$
(10)

where \({\mathcal{H}}\left( {\pi \left( { \cdot |{\mathbf{s}}_{i} } \right)} \right)\) is the entropy of policy \(\pi\), \(\alpha\) is the temperature parameter that determines the relative importance of the entropy term against the reward. Adding entropy term to the objective function can maximize the accumulative reward while maintaining the exploration ability of the policy at the same time.

Soft state action value function and soft state value function are defined as:

$$ \begin{aligned}Q\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) &= r\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} } \right) + {\mathbb{E}}_{{\left( {{\mathbf{s}}_{t + 1} , \ldots } \right)\sim \rho_{\pi } }}\\ &\quad\times \left[ {\sum\limits_{l = 1}^{\infty } {\gamma^{l} } \left( {r\left( {{\mathbf{s}}_{t + l} ,{\mathbf{a}}_{t + l} ,{\mathbf{s}}_{t + l + 1} } \right) + \alpha {\mathcal{H}}\left( {\pi_{{{\text{MaxEnt}}}}^{*} \left( { \cdot |{\mathbf{s}}_{t + l} } \right)} \right)} \right)} \right]\;,\end{aligned} $$
(11)
$$V\left( {{\mathbf{s}}_{t} } \right) = \alpha \log \int_{{\mathcal{A}}} {\exp } \left( {\frac{1}{\alpha }Q\left( {{\mathbf{s}}_{t} ,{\mathbf{a^{\prime}}}} \right)} \right){\text{d}}{\mathbf{a^{\prime}}}\;.$$
(12)

The soft Bellman equation is represented as:

$$Q\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) = r\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} } \right) + \gamma {\mathbb{E}}_{{{\mathbf{s}}_{t + 1} \sim p}} \left[ {V\left( {{\mathbf{s}}_{t + 1} } \right)} \right]\;,\;$$
(13)

where \(\left( {{\mathbf{s}}_{0} ,{\mathbf{a}}_{0} ,{\mathbf{s}}_{1} ,{\mathbf{a}}_{1} , \ldots } \right)\) is the trajectory obtained by executing policy \(\pi\). According to the value function, the optimal policy is

$$\pi_{{{\text{MaxEnt}}}}^{*} \left( {{\mathbf{a}}_{t} |{\mathbf{s}}_{t} } \right) = \exp \left( {\frac{1}{\alpha }\left( {Q\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) - V\left( {{\mathbf{s}}_{t} } \right)} \right)} \right)\;.$$
(14)

In SAC, the actor and critic are approximated by corresponding networks \(\pi_{\theta } \left( {{\mathbf{a}}|{\mathbf{s}}} \right)\) and \(Q_{w} \left( {{\mathbf{s}},{\mathbf{a}}} \right)\), parameterized by \(\theta\) and \(w\) respectively. To stabilize the learning procedure, the target value is calculated using a slowly updated target Q-network, denoted by \(Q_{{w^{\prime}}} \left( {{\mathbf{s}},{\mathbf{a}}} \right)\) parameterized by \(w^{\prime}\). To improve data efficiency, experience replay buffer \({\mathcal{D}}\) is built to store the experience transition data \(\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} ,r_{t} } \right)\). In each training step, a mini-batch is randomly sampled from \({\mathcal{D}}\), and the critic network is updated by minimizing the following loss function:

$$J_{Q} \left( w \right) = {\mathbb{E}}_{{\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) \sim {\mathcal{D}}}} \left[ {\frac{1}{2}\left( {Q_{w} \left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) - y_{t} } \right)^{2} } \right]\;,$$
(15)

where \(y_{t}\) is target value and more details will be introduced in Sect. 4.

In traditional actor-critical algorithms, the policy network is updated with the gradient of \(Q_{w} \left( {{\mathbf{s}},{\mathbf{a}}} \right)\). However, since the policy defined by Eq. (14) is intractable, it is necessary to re-parameterize the policy. One option is the Gaussian policy. To limit the action to a specific interval \({\mathbf{a}}_{t} \in \left[ {{\mathbf{a}}_{min} ,\;{\mathbf{a}}_{max} } \right]\), the action is expressed as:

$$\begin{gathered} {\mathbf{u}}_{t} = {{\varvec{\upmu}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right) + {{\varvec{\upvarepsilon}}}_{t} \circ {{\varvec{\upsigma}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right)\;, \hfill \\ {\mathbf{a}}_{t} = {\mathbf{a}}_{s} \circ \tanh \left( {{\mathbf{u}}_{t} } \right) + {\mathbf{a}}_{mean} \;, \hfill \\ \end{gathered}$$
(16)

where \({{\varvec{\upmu}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right),\;{{\varvec{\upsigma}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right)\) are the two output heads of the policy network, represent mean and variance, respectively; \({{\varvec{\upvarepsilon}}}_{t}\) is the noise vector, sampled from the Normal distribution, \(\circ\) is the Hadamard product, \({\mathbf{a}}_{s}\) and \({\mathbf{a}}_{mean}\) are the scale and mean of action:

$$\begin{gathered} {\mathbf{a}}_{s} = \frac{{{\mathbf{a}}_{max} - {\mathbf{a}}_{min} }}{2}\;, \hfill \\ \;{\mathbf{a}}_{mean} = \frac{{{\mathbf{a}}_{max} + {\mathbf{a}}_{min} }}{2}\;, \hfill \\ \end{gathered}$$
(17)

where \({\mathbf{a}}_{max}\) and \({\mathbf{a}}_{min}\) are the upper and lower limits of actions. In this parameterization, the log-likelihood of action \({\mathbf{a}}_{t}\) is:

$$\begin{aligned} \log \pi \left( {{\mathbf{a}}_{t} |{\mathbf{s}}_{t} } \right) &= \log {\mathcal{N}}\left( {{\mathbf{a}}_{t} |{{\varvec{\upmu}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right),{{\varvec{\upsigma}}}_{\theta } \left( {{\mathbf{s}}_{t} } \right)} \right) \\ &\quad- \sum\limits_{i = 1}^{D} {\log \left( {a_{s,i} \left( {1 - \tanh^{2} \left( {a_{t,i} } \right)} \right)} \right)} \;, \end{aligned}$$
(18)

where \(a_{s,i}\) and \(a_{t,i}\) are the i-th elements of \({\mathbf{a}}_{s}\) and \({\mathbf{a}}_{t}\) respectively.

In this way, random action \({\mathbf{a}}_{t}\) can enhance the exploration ability, while ensuring security. In the training procedure, the policy network parameters are updated to minimize the Kullback–Leibler (KL) divergence:

$$J_{\pi } \left( \theta \right) = D_{KL} \left( {\pi \left( { \cdot |{\mathbf{s}}_{t} } \right)\left\| {\exp \left( {\frac{1}{\alpha }\left( {Q\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) - V\left( {{\mathbf{s}}_{t} } \right)} \right)} \right)} \right.} \right)\;.$$
(19)

With Eq. (12) and (14), the objective can be re-written as:

$$J_{\pi } \left( \theta \right) = {\mathbb{E}}_{{{\mathbf{s}}_{t} \sim {\mathcal{D}},{\mathbf{a}}_{t} \sim \pi_{\theta } }} \left[ {\alpha \log \pi_{\theta } \left( {{\mathbf{a}}_{t} |{\mathbf{s}}_{t} } \right) - Q_{w} \left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right)} \right]\;.$$
(20)

The parameter \(w^{\prime}\) of the target network \(Q_{{w^{\prime}}} \left( {{\mathbf{s}},{\mathbf{a}}} \right)\) is softly updated by:

$$w^{\prime} = \tau w + \left( {1 - \tau } \right)w^{\prime}\;,$$
(21)

where \(0 < \tau \ll 1\) is the update rate.

In addition, the temperature coefficient \(\alpha\) determines the weight of entropy in the objective (10), therefore, it affects the learned assembly policy. During the training process, it is updated with the following objective

$$J\left( \alpha \right) = {\mathbb{E}}_{{{\mathbf{a}}_{t} \sim \pi }} \left[ { - \alpha \log \pi_{\theta } \left( {{\mathbf{a}}_{t} |{\mathbf{s}}_{t} } \right) - \alpha {\tilde{\mathcal{H}}}} \right]\;,$$
(22)

where \({\tilde{\mathcal{H}}}\) is the entropy target.

4 Model acceleration

In conventional methods, the target value in Eq. (15) is calculated with TD(0), which is based on the one-step reward and bootstrapping from the estimated value of the next state with the target network \(Q_{{w^{\prime}}} \left( {{\mathbf{s}},{\mathbf{a}}} \right)\). However, the inaccuracy of the target network leads to an inaccurate target value which in turn reduces training efficiency. Besides, SAC is a model-free algorithm, the parameters are updated only with real transition data \(\left( {{\mathbf{s}}_{i} ,{\mathbf{a}}_{i} ,{\mathbf{s}}_{i + 1} ,r_{i} } \right)\) without considering system dynamics \(P\left( {{\mathbf{s}}_{t + 1} |{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right)\), which leads to significant large demand for transition data. However, in robotic applications, collect such amount of data is time-consuming and tedious even may lead to physical damage. Therefore, improve data efficiency is of great significance. To solve these problems, this paper learns the environment dynamics model and then use it to improve the estimation of future reward and augment the transition data.

4.1 Gaussian process for dynamics modeling

Existing researches have proposed a variety of model classes for dynamics modeling. Among them, the GP is the state-of-the-art approach, because the GP is a distribution over functions with a continuous domain and there is not any assumption about the function that maps current states and actions to future states (Polydoros and Nalpantidis 2017). This fact makes GP a powerful method for dynamics modeling.

A GP is specified by a mean function \(m\left( \cdot \right)\) and a covariance/kernel function \(k\left( \cdot \right)\):

$$f\left( {\mathbf{x}} \right) \sim {\mathcal{G}\mathcal{P}}\left( {m\left( {\mathbf{x}} \right),k\left( {{\mathbf{x}},{\mathbf{x^{\prime}}}} \right)} \right)\;.$$
(23)

To predict the state \({\mathbf{s}}_{t + 1}\) after action \({\mathbf{a}}_{t}\) is performed in state \({\mathbf{s}}_{t}\), \({\tilde{\mathbf{x}}} = \left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right)\) is used as training input, and the difference \({{\varvec{\Delta}}}_{t} = {\mathbf{s}}_{t + 1} - {\mathbf{s}}_{t}\) is taken as training target. In this way, the prior mean function is set to \(m\left( \cdot \right) \equiv 0\), which simplifies the calculation. The kernel function is defined as:

$$k\left( {{\tilde{\mathbf{x}}},{\mathbf{\tilde{x}^{\prime}}}} \right) = \sigma_{f}^{2} \exp \left( { - \frac{1}{2}\left( {{\tilde{\mathbf{x}}} - {\mathbf{\tilde{x}^{\prime}}}} \right)^{T} {{\varvec{\Lambda}}}^{ - 1} \left( {{\tilde{\mathbf{x}}} - {\mathbf{\tilde{x}^{\prime}}}} \right)} \right),$$
(24)

where \({{\varvec{\Lambda}}}{ = }diag\left( {\left[ {l_{1}^{2} , \ldots ,l_{D}^{2} } \right]} \right)\), \(l_{i}^{{}}\) is the characteristic length-scales, \(\sigma_{f}^{{}}\) is the signal variance. Given a set of training input data \({\tilde{\mathbf{X}}} = \left[ {{\tilde{\mathbf{x}}}_{1} ,{\tilde{\mathbf{x}}}_{2} , \ldots ,{\tilde{\mathbf{x}}}_{n} } \right]\) and the corresponding target value \({\mathbf{y}} = \left[ {{{\varvec{\Delta}}}_{1} ,{{\varvec{\Delta}}}_{2} , \ldots ,{{\varvec{\Delta}}}_{n} } \right]\), the hyper-parameters can be learned by maximizing the marginal likelihood (Williams and Rasmussen 2006).

The environment dynamic model can be used to predict the results of actions \({\mathbf{a}}_{t}\) performed in state \({\mathbf{s}}_{t}\). In the GP model, the posterior distribution of successor states is Gaussian distribution:

$$p\left( {{\mathbf{s}}_{t + 1} |{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) = {\mathcal{N}}\left( {{\mathbf{s}}_{t + 1} |{{\varvec{\upmu}}}_{t + 1} ,{{\varvec{\Sigma}}}_{t + 1} } \right)\;,$$
(25)

where

$$\begin{gathered} {{\varvec{\upmu}}}_{t + 1} = {\mathbf{s}}_{t} + {{\mathbb{E}}}_{f} \left[ {{{\varvec{\Delta}}}_{t} } \right]\;, \hfill \\ {{\varvec{\Sigma}}}_{t + 1} = {\text{var}}_{f} \left[ {{{\varvec{\Delta}}}_{t} } \right]\;. \hfill \\ \end{gathered}$$
(26)

The mean and variance of the GP predictor are

$$\begin{gathered} {{\mathbb{E}}}_{f} \left[ {{{\varvec{\Delta}}}_{t} } \right] = {\mathbf{k}}_{*}^{T} \left( {{\mathbf{K}} + \sigma_{\varepsilon }^{2} {\mathbf{I}}} \right)^{ - 1} {\mathbf{y}}\;, \hfill \\ {\text{var}}_{f} \left[ {{{\varvec{\Delta}}}_{t} } \right] = {\mathbf{k}}_{**} - {\mathbf{k}}_{*}^{T} \left( {{\mathbf{K}} + \sigma_{\varepsilon }^{2} {\mathbf{I}}} \right)^{ - 1} {\mathbf{k}}_{*}^{{}} \;, \hfill \\ \end{gathered}$$
(27)

where \({\mathbf{k}}_{*} = k\left( {{\tilde{\mathbf{X}}},{\tilde{\mathbf{x}}}_{t} } \right)\), \({\mathbf{k}}_{**} = k\left( {{\tilde{\mathbf{x}}}_{t} ,{\tilde{\mathbf{x}}}_{t} } \right)\) and \({\mathbf{K}}\) is the kernel matrix with elements \(K_{ij} = k\left( {{\tilde{\mathbf{x}}}_{i} ,{\tilde{\mathbf{x}}}_{j} } \right)\), \(\sigma_{\varepsilon }^{{}}\) is the noise variance.

4.2 Model acceleration with λ-return

To improve target value estimation, the short-term value return is estimated by unrolling the learned model dynamics, and the long-term value is estimated with the learned target critic network. According to Eq. (11), the n-step TD target of state \({\mathbf{s}}_{t}\) under policy \(\pi_{\theta }\) is

$$\begin{aligned} y_{t}^{n} &= \sum\limits_{i = 0}^{n} {\gamma^{i} \left[ {r\left( {{\mathbf{s}}_{t + i} ,{\mathbf{a}}_{t + i} ,{\mathbf{s}}_{t + i + 1} } \right) - \alpha \gamma \log \pi \left( {{\mathbf{a}}_{t + i + 1} |{\mathbf{s}}_{t + i + 1} } \right)} \right]} \\ &\quad+ \gamma^{n + 1} Q_{{w^{\prime}}} \left( {{\mathbf{s}}_{t + n + 1} ,{\mathbf{a}}_{t + n + 1} } \right)\;, \end{aligned}$$
(28)

where \(P\left( {{\mathbf{s}}_{t + 1} |{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} } \right) \sim {\mathcal{G}\mathcal{P}}\). Since the action \({\mathbf{a}}_{t}\) is sampled from the current policy \(\pi_{\theta } \left( {{\mathbf{s}}_{t} } \right)\), it is an on-policy procedure, importance weights are not needed.

Since the rollout is simulated by the dynamic model rather than interacted with the environment, the distribution mismatch of the dynamic model will directly affect the value estimation accuracy, and the estimation error will accumulate with the simulation step increase. To solve this problem, the \(\lambda\)-return is used to average the n-steps target along with the rollout:

$$y_{t}^{\lambda } = \left( {1 - \lambda } \right)\sum\limits_{i = 1}^{n} {\lambda^{i - 1} y_{t}^{i - 1} } + \lambda^{n} y_{t}^{n} \;,$$
(29)

In this way, the estimated target value can be improved, which in turn improves the training efficiency.

4.3 Model acceleration with imagination

Another obstacle for applying deep RL methods in real assembly tasks is the large demand for transition data. However, after the dynamic model is learned, virtual transition data can be generated by simulating the model to improve data efficiency.

In conventional methods, the transition \(\left( {{\mathbf{s}}_{t} ,{\mathbf{a}}_{t} ,{\mathbf{s}}_{t + 1} ,r_{t} } \right)\) sampled from the replay buffer \({\mathcal{D}}\) only considers the action \({\mathbf{a}}_{t}\) selected and executed in the past. To explore more states, the proposed method selects new action \({\mathbf{a^{\prime}}}_{t}\) with the current policy and predicts the successor state and reward with the learned dynamic model and reward function, and then, augments transition data with these virtual transitions.

figure e

Moreover, because these virtual data are generated by the current policy, the training is on-policy. In addition, all data are generated online, there is no need for extra replay buffer to save virtual data. The model acceleration method is shown in Fig. 3.

Fig. 3
figure 3

λ-return and virtual transition with the dynamic model

4.4 Algorithm summary

The architecture of the proposed model accelerated RL algorithm for robotic assembly task is shown in Fig. 4. To further reduce the critic estimation error, similar to TD3 (Fujimoto et al. 2018), two critic networks are used in SAC. For each time step, the critics and actor are updated with the minimum target and critic value of actions selected with the policy. In the training process, the dynamic model is updated after an assembly trial is completed, and the network parameters are trained T times with real data sampled from replay buffer and \(T \times N\) times with virtual data generated by the dynamic model. The exact procedure is listed in Algorithm 1.

Fig. 4
figure 4

The architecture of the proposed method

5 Experiments

5.1 Experimental setup

To validate the feasibility of the proposed learning-based assembly method, experiments are conducted on an industrial robot. The layout of the system architecture for experiments is illustrated in Fig. 5. The robot is UR5 from Universal robotics, and a force sensor is mounted on the end effector to measure the contact force between the peg and hole. The assembly controller and impedance controller are implemented on a PC with Intel Core i7- 5500U CPU @ 2.40 GHz CPU, 16 GB RAM, and Ubuntu 14.04 operating system. The PC collects force/torque from a force sensor with MCC 1608-FS-Plus data acquisition device and communicates with the robot controller with TCP/IP protocol at the rate of 125 Hz. After acquiring the F/T measurement and robot position, the actual contact force is calculated by the real-time gravity compensation module. All the programming is implemented with the Robot Operating System (ROS) framework, and the assembly controller and impedance controller are in different nodes, the impedance controller runs at 125 Hz, and the assembly controller runs at 2 Hz. The peg and hole used in the experiment are made of steel, and their diameters are 15.00 mm and 15.02 mm, respectively, the depth of the hole is 35 mm. Different from the two-dimensional assembly experiment in Ref. (Ren et al. 2018), the more complex three-dimensional assembly task is adopted in this experiment.

Fig. 5
figure 5

Experimental layout

During training, the robot starts from a random pose within \(\pm 0.1\) mm lateral error and \(\pm 4^{ \circ }\) angular error. The mass matrix \({\mathbf{M}}_{d}^{{}}\) and damping matrix \({\mathbf{K}}_{D}\) in the impedance controller are set to diag([0.67, 0.67, 0.67, 0.017, 0.017, 0.017]) and diag([266.7, 266.7, 533.3, 3.3, 3.3, 3.3]) respectively. The lower and upper limits of action in Eq. (17) are set to ([− 5 N, − 5 N, 20 N, − 0.5 Nm, − 0.5 Nm]) and ([5 N, 5 N, 30 N, 0.5 Nm, 0.5 Nm]) respectively. The maximum number of steps \(N_{max}\) is 60. For the safety of the robot and force sensor, the maximum allowed force and moment are 70 N and 4 Nm respectively. In one episode, the robot executes the assembly actions step by step until the task succeeds by inserting the peg into goal depth or fail by whether exceeding the maximum allowed contact forces/torque boundary or the maximum number of steps. The dynamic model is fitted with 400 real transition samples. The other hyper-parameters are listed in Table 1.

Table 1 Hyper-parameters for training

5.2 Results

To evaluate the training performance of the proposed method, the insert task is trained for 135 episodes in two methods, i.e. without model acceleration and with model acceleration, the training reward is shown in Fig. 6. It can be fund that, the performances of the two methods are improved with the increase of episodes. After the policy converges, the reward and success rate of the two methods are basically the same, which means the two learned policies have the same performances. In Fig. 6a the policy achieves a high success rate after 95 episodes. However, as shown in Fig. 6b, with the benefits of the dynamic model, the proposed method only needs 65 episodes, which demonstrates the training efficiency is improved by 31%. The average Q value losses are shown in Fig. 7, with model acceleration, the loss values are reduced to 0.01 with 600 training steps, however, without model acceleration, it will take 2000 training steps. The reason for this is, on the one hand, the target value is improved with \(\lambda\)-return using the dynamic model, on the other hand, the current policy is used to select actions and simulated with the dynamic model to explore more states and actions. After 3000 training steps, due to the error of the learned dynamic model, the loss values for the proposed method are a little larger. However, since the loss value is very small, it has little influence on the policy. It is worth noting that although the model acceleration will increase the calculation, however, the learning process is in the initialization of the robot between two assembly executions, therefore, it does not increase the total time of the training process.

Fig. 6
figure 6

Training reward curve. a Episode reward without model acceleration; b Episode reward with model acceleration

Fig. 7
figure 7

Critic losses in training. a Q1; b Q2

In the training process, a GP is used to learn the dynamics, and Fig. 8 shows the prediction result with the learned model for one episode. Because the actual contact point in the insertion will change with the adjustment of peg’s pose, and the desired contact force may not be achieved in one control cycle of the assembly controller, the specific value of the next state cannot be accurately predicted, but the results reflect the trend of the states. For this reason, the \(\lambda\)-return is used to reduce the influence of this distribution mismatch in the proposed method.

Fig. 8
figure 8

Prediction result with the dynamic model for one episode

After the training process, the learned assembly policy is evaluated. An insert process with 5° angular error is shown in Fig. 9, and the insert trajectory is shown in Fig. 10. The robot starts from a random pose as shown in Fig. 9a and it enters the insertion stage when it reaches the top of the hole. During the first 4 steps, the peg inserts down quickly until in the 5-th step contacts with the hole, as shown in Fig. 9b. Due to the large angular error, the peg gets stuck and falls into jamming state. In step 5–45, the peg adjusts pose by executing the actions selected according to the assembly policy with the impedance controller. Finally, the angular error is eliminated in step 45, as shown in Fig. 9c. In the last stage, the peg dashes down to the goal depth and finish the insert task.

Fig. 9
figure 9

Insert process with 5° angular error. a Starting from a random pose; b, c align the peg with the hole; d, e dashing

Fig. 10
figure 10

Trajectory by the learned policy. a Force trajectory; b Moment trajectory; cz-axis position; dz-axis position increment

The results of the first experiment show that the proposed method and SAC have the same performances in success rate and episode reward after the policies converge. To further demonstrate the effectiveness of the proposed method, the success rate comparison between the proposed method and the pure impedance control method is conducted with different angular errors. In the pure impedance control method, the assembly process is only controlled with the impedance controller described in Eq. (3). For each angular error, ten inserts are conducted, and the result is listed in Table 2. It can be found that the pure impedance control method can only deal with the maximum angular error not greater than 4°. This is because jamming happens when the angular error is large. At this time, the error cannot be eliminated by using only impedance control. However, the proposed method can ensure success up to 5° and it is still possible to succeed when the angular error is 8°. Meanwhile, because of the continuous action and stochastic policy adopted in the proposed method, the performance of the proposed method is better than that of the DQN based method proposed in Ref. (Inoue et al. 2018), where the maximum allowable initial angular error is 1.6°.

Table 2 Success rate with different angular errors

Although the proposed method can speed up the training process, learning such skills from scratch still needs a lot of trials. A standard way to accelerate training is initializing the networks with the one trained for another task. Next, we will validate the transferability of the learned skill for accelerating the learning process. In this experiment, a new insertion task is adopted, where the diameters for peg and hole are 29.98 mm and 30.01 mm respectively, and they are all made of aluminum alloy. And the networks’ weights are initialized with the learned ones in the last experiment. The initial lateral and angular errors are \(\pm 0.1\) mm and \(\pm\) 4° respectively, the training reward is shown in Fig. 11 and an insert process is shown in Fig. 12. In the first 30 episodes, the learned policy can still complete the new task with 86.7% success rate, and the performance is gradually improved with the increase of learning steps. Compared with the zero-start situation in Fig. 6, the training efficiency is significantly improved. This is because, on the one hand, the Gaussian policy used in this paper has a strong exploration ability; on the other hand, an impedance controller is used in the low-level control, therefore, the assembly policy is less affected by the size and materials of parts and the learned policy can serve as a good initialization for a new task.

Fig. 11
figure 11

Training reward curve of transferring the learned policy to a new task

Fig. 12
figure 12

Insert process for the new parts. a Starting from a random pose; b, c align the peg with the hole; d, e dashing

6 Conclusion

In this paper, a RL method is proposed to learn high precision robotic assembly skills, which integrates model-free and model-based learning techniques. Compared with existing methods, it has the following advantages: (1) the assembly policy is learned with the maximum entropy RL framework, the training process is efficient and robust, meanwhile, the learned skill is allowed be transferred to new tasks; (2) the robot is driven by an impedance controller to ensure the exploration efficiency and safety in training; (3) to improve training efficiency and reduce sample complexity, the environment dynamics are learned with Gaussian Process to improve the target value estimation and argument transition samples. Experiments are conducted on an industrial robot. The results demonstrate that the proposed method can reduce sample complexity and improve the training efficiency compared with the model-free method, and the learned policy can serve as a good initialization for a new task.

Considering that in the process of learning skills, people usually do not directly start from scratch, but first acquire the initial skills through imitation, and then refine the skills through trial and error. Therefore, one of the directions of our future work is to combine imitation learning and reinforcement learning to improve learning efficiency. In addition, this paper only adopted the classical cylindrical peg-in-hole to validate the proposed method, the future work will apply the proposed method to irregular shape or deformable parts assembly tasks.