Keywords

1 Introduction

Reinforcement Learning (RL) is one of the three main paradigms of Machine Learning. RL is composed by: an agent, an environment and its model, a policy, a reward signal and a value function [1]. The main idea is that an agent can be trained through direct interaction with its environment, therefore it is a learning process to take actions to maximize a numerical reward signal [1, 2].

Registration is a geometrical aligning process between two or more images, taken in different times, views or modalities, for comparison and combination of information purposes [3]. In the medical field, this process has an important role, especially in patient monitoring and treatment [4]. This requires robust algorithms that allow to obtain better results using reasonable time and computing resources.

The aim of this work is to apply RL algorithms, Q-learning, Deep-Q-Network (DQN), Double DQN and Dueling DQN, to image registration tasks in order to compare their performances. Moreover, a backup memory criterion, as a novel way to train a neural network, is introduced in order to improve the results of DQN algorithms. It is applied in Double DQN to analyze their performances.

This work is organized as follows: Sect. 2 summarizes the algorithms used; Sect. 3 describes their application in an image registration process; Sect. 4 shows the obtained results and Sect. 5 presents the conclusions and future work.

2 Reinforcement Learning Algorithms

2.1 General Description

Reinforcement learning is based on dynamic system theory for optimal control. Markov decision processes (MDPs) are used due its mathematical background. MPDs are the formalization of sequential decision making where actions can influence not just immediate rewards but also in future states and rewards [1].

A finite Markov Decision process is defined by a tuple \( \left\langle {S, A, T, R, E} \right\rangle \), where \( S \) is the agent’s state-space, \( A \) is the action-space, \( T \) is the dynamic transition in reference to \( T\left( {s, a, s^{\prime}} \right) \) i.e. the probability to select an action \( a \) in current state \( s \) to obtain the next state \( s^{\prime} \), \( R \) is the numeric reward value obtained when an agent change the state \( s \) to \( s^{\prime} \) selecting an action, and \( E \) is the terminal states set to avoid future transitions where \( E \subset S \) [5]. Thus, the probability of those values occurring at time \( t \), given particular values of the state and actions, is defined as:

$$ p\left( {s^{\prime}, r | s, a} \right) = \Pr \left\{ {S_{t} = s^{\prime}, R_{t} = r | S_{t - 1} = s, A_{t - 1} = a} \right\} . $$
(1)

2.2 Q-Learning Algorithm

Q-learning is a RL technique derived from Temporal Difference (TD) methods. TD methods combine properties of Dynamic programming algorithms and Monte-Carlo methods. The goal is to learn a policy, which tells an agent what action to take in different states [6].

In order to adjust the polices, Watkins [7] introduced an action-value function \( Q^{\pi } \left( {s, a} \right) \) which is the expected value in the initial state \( s \), selecting an action \( a \) through the policy \( \pi \). According to the Bellman equation, the \( Q \) function [8] for an optimal policy, is defined as:

$$ Q^{ *} \left( {s,a} \right) = \sum\nolimits_{{s^{\prime}\epsilon S}} {T\left( {s,a,s^{\prime}} \right)\left[ {R\left( {s,a,s^{\prime}} \right) + \gamma \mathop {\hbox{max} }\nolimits_{a^\prime} Q^{ *} \left( {s^{\prime},a^{\prime}} \right)} \right]} , $$
(2)

where, \( 0 \le \gamma \le 1 \), is the discounting factor to evaluate how valuable is the immediate reward in terms of the expected final reward.

Function \( Q \) updating is simply performed with instantaneously available information, which is a function of the optimal actions (normally a greedy policy is used) [6]. The update of function \( Q \) at time \( t \) can be written as:

$$ \begin{array}{*{20}l} {\delta_{t} = \left[ {r_{t + 1} + \gamma \mathop {\hbox{max} }\limits_{a'} Q_{t} \left( {s_{t + 1} ,a^{\prime}} \right) - Q_{t} \left( {s_{t} ,a_{t} } \right)} \right]} \hfill \\ {\;\;\;\;\;\;\;Q_{t + 1} \left( {s_{t} ,a_{t} } \right) \leftarrow Q_{t} \left( {s_{t} ,a_{t} } \right) + \alpha_{t} \left( {s_{t} ,a_{t} } \right)\delta_{t} ,} \hfill \\ \end{array} $$
(3)

where \( \alpha \) is the learning rate, \( \alpha\, \epsilon\, \left( {0,1} \right] \).

2.3 Deep-Q Network Algorithm

The DQN algorithm, proposed by Mnih [9], uses a multilayer neural network that has an array of states \( S \) as input and an array of actions \( A \) as output. The action-value function will be \( Q\left( {s, .; \theta } \right) \), where \( \theta \) are the weights of the network called Deep. The function \( Q \) is calculated by updating \( \theta \) based on experience [10, 11].

In this case, an agent learns the parameterized function \( Q\left( {s, a; \theta_{t} } \right) \):

$$ \theta_{t + 1} = \theta_{t} + \alpha \left( {Y_{t}^{Q} - Q\left( {s_{t} ,a_{t} ;\theta_{t} } \right)} \right)\nabla_{{\theta_{t} }} Q\left( {s_{t} ,a_{t} ;\theta_{t} } \right) , $$
(4)

where \( \alpha \) is the learning rate and \( Y_{t}^{Q} \) are the target values computed by: \( Y_{t}^{Q} = r_{t + 1} + \gamma \mathop {\hbox{max} }\nolimits_{a} Q\left( {s_{t + 1} , a;\theta_{t} } \right) \).

2.4 Double DQN Algorithm

If the same action-value function is used to evaluate and select an action \( a \), it may lead to over-optimization. In the Double DQN algorithm, an agent learns two action-value functions through assignment of random experiences to obtain two sets of parameters or weights \( \theta \) and \( \theta ' \) [10]. To update the function, one set of parameters is used to determine the policy and the other one is used to compute the value. The target values in Double DQN algorithms are computed as:

$$ Y_{t}^{DoubleQ} = r_{t + 1} + \gamma Q\left( {\mathop {\hbox{max} }\nolimits_{a} Q\left( {s_{t + 1} , a;\theta_{t} } \right); \theta_{t}^{ '} } \right) . $$
(5)

2.5 Dueling DQN Algorithm

It is usually assumed that the choice of an action is always a priority, but there are some cases, especially in a free-model problem, where selecting an action or another does not affect the overall performance of the agent [12]. Therefore, the action-value function \( Q\left( {s, a; \theta_{t} } \right) \) can be separated into its two main subfunctions:

$$ Q \left( {s,a; \theta , \alpha ,\beta } \right) = V\left( {s;\theta , \beta } \right) + A\left( {s,a; \theta , \alpha } \right) , $$
(6)

where \( V\left( s \right) \) is the value function, \( A\left( {s,a} \right) \) is the advantage function, \( \theta \) is the set of parameters of the neural network and \( \alpha , \beta \) are the set of parameters corresponding to \( V\left( s \right) \) and \( A\left( {s,a} \right) \). The subfunctions are combined in the output layer in order to obtain the function \( Q \) [13].

3 Description of Problem and Applications of Algorithms

3.1 Medical Image Registration

Two images are considered during the registration process: the target image and the moving image. The moving image is deformed to align to the coordinate system of the target image [14].

A geometric transformation is estimated considering 3 parameters: scale, orientation and translation. Monomodal and multimodal registration are performed in the 2D domain with the translation parameter over the X and Y axes.

3.2 Image Description

The image dataset is composed of brain magnetic resonance images (MRI) in T1 and T2 modalities and SPECT images (Fig. 1). The images were provided by the Radiologic Institute of Mar del Plata, Argentina. In these images, only rigid deformations can occur during acquisition.

Fig. 1.
figure 1

Brain magnetic resonance images used in tests. G1 - MRI of the same slide in T1 and T2 weighted modalities. G2 - MRI of same slide in T2 weighted and SPECT modalities

T1 and T2 images are sized 256 × 256, 8 bits, grayscale. SPECT images have the same size, RGB color space, 8 bits per channel.

3.3 Software and Hardware Description

RL algorithms were implemented in Python 3.6, using TensorFlow 1.10. The tests were performed using a Core-i5 computer with 8 GB RAM and Debian operating system.

3.4 RL Algorithms in Image Registration

RL algorithms have been adapted for image registration taking the next considerations:

  1. 1.

    Environment: composed by the target image and the moving image. The moving images are created using the parameters of Table 1 as targets, in order to measure the error achieved by each algorithm.

    Table 1. Parameters used to generate the testbench to test RL algorithms.
  2. 2.

    States: described as the set \( \left[ {s, \theta , t_{x} , t_{y} } \right] \), where \( s \) is the scale value, \( \theta \) is the orientation angle, and \( t_{x} \) and \( t_{y} \) are the values to move pixels over the X and Y axes. The initial state is set as \( \left[ {1.0, 0^\circ , 0, 0} \right] \).

  3. 3.

    Actions: the total number of actions is \( 2^{4} = 16 \) because there are \( 4 \) transformation parameters and \( 2 \) possibilities for each one (increment or decrement). The parameter step is defined by \( \left[ { \pm 0.05, \pm 0.25^\circ , \pm 1, \pm 1} \right] \).

  4. 4.

    Reward: Pearson correlation coefficient in monomodal registration and Mutual information in multimodal registration are considered. These similarity measures are maximized during the registration process. When an action produces a decrease in similarity, the agent receives a penalty of −0.1 and the algorithm restarts the learning process.

3.5 Backup Memory Criterion

In DQN algorithms, the agent experiences are stored in order to train the neural networks. A batch technique is used for this purpose [15]. In the backup memory proposed in this work, best experiences are saved, and they are only replaced by others with better reward values, which prevents these good solutions from being lost.

In a random position of the batch, current information is replaced by experiences of the backup memory. Therefore, the batch contains information from the database of experiences (good and bad ones) and from the backup memory (best experiences). In this work, backup memory is used with the Double DQN algorithm and the results show a good performance compared to the other algorithms.

4 Results

The initial parameters are the same for all algorithms. The number of epochs is 200 and the learning process runs 100 iterations, though it can end before if the agent receives a penalty. The tests are evaluated along 5 independent runs.

In Figs. 2 and 3, the mean values of the Pearson coefficient for monomodal registration are presented. In Fig. 3, the double-back_ algorithm, which uses the backup memory criterion, achieved a better correlation factor than the other algorithms.

Fig. 2.
figure 2

Mean of monomodal registration results obtained according to parameters of Test 1.

Fig. 3.
figure 3

Mean of monomodal registration results obtained according to parameters of Test 2.

In Figs. 4 and 5, the mean values of mutual information for multimodal registration are presented. The double-back_ algorithm shows a similar performance to other Deep Networks algorithms.

Fig. 4.
figure 4

Mean of multimodal registration results obtained according to the parameters of Test 1.

Fig. 5.
figure 5

Mean of multimodal registration results obtained according to the parameters of Test 2.

In Tables 2 and 3, the mean values of similarity measures and errors computed for monomodal and multimodal registration tests are presented. The tables show the performance of each algorithm through mean values of similarity measures and error between target and obtained parameters.

Table 2. Performance of RL algorithms achieved during Monomodal registration.
Table 3. Performance of RL algorithms achieved during Multimodal registration.

Finally, average of times spent during the tests are presented in Table 4.

Table 4. Average time spent by each algorithm during tests.

5 Conclusions

In this work, the Reinforcement Learning paradigm is applied to medical image registration. A performance comparison study between four RL algorithms is presented. The registration process is an optimization problem where a similarity measure is maximized. RL allows that an agent could be trained in order to find the best parameters in medical images considering rigid deformations.

In all cases, the algorithms gave successfully results which allow a significant improvement in similarity between the target images and registered images. According to the results, Q-learning, Double-DQN and Double-DQN with backup memory show better performances than the others.

According to the results, the mean error obtained by Double-back is reduced around of 37% in Test1 and 35% in Test2 compared with the highest error achieved by each algorithm. Q-learning and Double-DQN reduce the mean error around of 24% in Test1 and 19% in Test2. Moreover, these algorithms also use a reasonable time for the training process which is important in reference to computational resources available.

Future work focuses on analyzing the optimum size of batch and backup memories to optimize the training of the deep neural networks which could improve their performance and to maximize the reward values. The RL paradigm could also be applied in nonlinear registration.