1 Introduction

The great achievements in various image processing and speech processing tasks demonstrate the powerful function approximation ability of deep learning (DL) methods for high-dimensional data processing. Borrowing the techniques of DL, a new subbranch of reinforcement learning (RL), termed deep reinforcement learning (DRL), demonstrates excellent performance in many complex decision-making tasks. Since the deep Q-network (DQN) [1] has achieved human-level performance in a majority of Atari games, researchers have contributed many methods to the DRL approach, e.g., extending DRL to a continuous action space [2], asynchronous training [3], robust exploration [4], and effective sampling [5]. DRL methods have achieved many successes in various fields, such as games [6, 7], robotics [8,9,10], natural language processing [11, 12], and healthcare [13,14,15].

DRL algorithms are able to train an effective artificial intelligence (AI) agent, but the agent’s behavior becomes uninterpretable due to the introduction of a deep learning model. Human beings cannot feel relieved when AI agents execute important tasks without reason, particularly in fields such as automatic pilot and healthcare. For example, a patient will have no idea how to choose an appropriate therapy if a healthcare agent just recommends to the patient a list of therapies without providing any referred reason. More seriously, the black-box model is susceptible to adversarial attacks [16]. In addition, end-to-end training DRL methods that lack interpretability are very difficult to debug and optimize. DRL agents with good interpretability will be a significant advancement before they are deployed to real-world scenes. Accordingly, the interpretability of DRL methods is becoming increasingly important [17, 18]. Despite its importance, the interpretable property of DRL methods has not received enough attention. The difficulties are mainly derived from the poor interpretability of deep neural networks (DNNs) applied in DRL models, and the interpretability of a model is difficult to measure.

As prior work in terms of interpretable DRL, Greydanus et al. [19] borrowed saliency map techniques from the interpretable DL field for DRL to visualize the important parts of an agent’s observation and to interpret the relationship between the observation of an agent and the behavior of an agent. By generating saliency maps by perturbation techniques, they demonstrated the effectiveness of the methods in explaining an agent’s attention, the reason for its decision, and the evolving learning process. Puri et al. [20] proposed a specific and relevant feature attribution (SARFA) that improved the precision of the saliency map techniques compared with Greydanus’s work. The saliency maps generated by the proposed attribution can focus more on the action-related features, so they can provide more accurate and focused interpretations. Although their methods can provide related information between the observation data and the decision-making processes, the related information is lacks sufficient clarity and needs to be further explained for nonprofessionals. Specifically, the saliency maps still enable no explicit interpretability of the relationship between the output action and the input observation data.

In this paper, we analyze the relationship between an agent’s actions and its observations from the perspective of information theory. A best policy exists in the condition of limited information provided by observation data. An agent’s performance upper bound will be determined by the amount of information contained in an observation. To provide explicit interpretability and the causal factors for the deep model, we propose a conceptual embedding technique to enhance the interpretability of a DRL-based agent. We seek the conceptual factors that directly relate to certain actions and need to measure the importance of the conceptual factors that affect the decisions of agents. Consequently, we can directly track the decision-driven factors for the agents’ different behaviors. To retain the effective information contained in the observation, we analyze the observation information of the latent features with respect to the agent’s performance. Through a simple example, we also demonstrate how the information loss of the observation can degenerate an agent’s performance. The key contributions of this paper are concluded as follows:

  • Intermediate representations are introduced to address the problem of causality between action and observation.

  • A conceptual embedding method is proposed to produce interpretable representation spaces in the deep reinforcement learning model.

  • Based on information theory, a relationship between an agent’s performance upper bound and its observation information is identified.

In the following part, we divide this paper into five sections. In Section 2, we review research on interpretable DRL methods and summarize the main approaches. In Section 3, we analyze the information processing of DRL-based agents and propose our method to improve the interpretability of DRL agents. Hierarchical conceptual embedding techniques are proposed, and the corresponding analyses are presented based on information theory. Next, certain experiments that can validate the proposed method and corresponding analyses are demonstrated in Section 4. We have a discussion in Section 5 and conclude this paper in Section 6.

2 Related work

In the early stage of the DRL development, Zahavy et al. [21] applied the t-distributed stochastic neighbor embedding (t-SNE) technique to explain the behavior of DQN-based agents in Atari environments and presented the embedding-based approach of aggregating the representations for state space. It is a good starting point to analyze the behavior of DRL-based agents from the perspective of observation space and state space, and the reduced state space is more easily related to an agent’s behavior. However, the authors just aggregated the state space and did not trace the influence of different features on an agent’s behavior. It is still necessary to identify an explicit relationship between an agent’s behavior and its observation.

With the development of interpretable DL, researchers have begun to borrow techniques from the interpretable DL field that can be applied to interpret DRL-based agents. Saliency maps are the main adopted techniques that highlight the important parts of observation data related to an agent’s decisions. Generally, the gradient-based approach and perturbation-based approach are the two main approaches. The gradient-based approach observes the sensitive features that greatly influence an agent’s decision by calculating the gradients. As an efficient approach, Simonyan et al. [22] calculated the Jacobian of the DNN model weights to extract the class saliency maps. Assume that Si[O] is the score of class i with respect to observation O. The saliency map of the DNN weights is calculated by the derivative of Si[O] with respect to a specific observation o, i.e., \(\frac {\partial S_{i}[O]}{\partial {o}}.\) Next, certain improved variants have been proposed to generate more interpretable saliency maps, such as DeepLIFT [23] and Grad-CAM [24]. Although gradient-based methods are mathematically reasonable for computing saliency maps, they are sensitive to noise and often generate meaningless salient points. Specifically, the gradient-based methods cannot be applied to the nondifferentiable models of the agents, and they cannot be carried out when the models of the agents are provided as packaged black-boxes.

The perturbation-based approach discovers the important features that can determine an agent’s decision by perturbing certain original features and then observing the variance of an agent’s decision. This approach is more intuitive and can provide straightforward explanations. To interpret DL models, Fong and Vedaldi [25] proposed a perturbation framework to discover the class-related parts of an image for any black-box model. Recently, the perturbation-based approach has become popular for interpreting the behavior of a DRL-based agent. Greydanus et al. [19] employed a perturbation-based method based on Gaussian blur that can produce saliency maps with respect to the actor network and critic network in the DRL architecture of A3C [3]. Let S[f] be the importance score or saliency score of the observation feature f with respect to an agent’s action a, and let \( {o}^{\prime }\) be the perturbed observation of the original observation o with respect to feature f. S[f] can be calculated by measuring the agent’s policy π(⋅) difference between the two observations, i.e., \(S[f] = \|\pi ({ {o}}) - \pi ({ {o}^{\prime }})\|\) or the value function Vπ(⋅) difference, i.e., \(S[f] = \|V_{\pi }( {o}) - V_{\pi }( {o}^{\prime })\|\). Based on the work of Greydanus et al., Puri et al. [20] designed an improved metric, termed SARFA, which can filter many irrelevant features and only highlight the key features with respect to a particular decision. However, it incurs a high computational cost of perturbing high-dimensional observation data, and the saliency maps are not explicitly related to the adopted actions.

The work of Iyer et al. [26] applies an idea similar to the premise of this work. The authors proposed a recognition process in the decision process of a DRL-based agent and produced saliency maps of the conceptual objects rather than raw pixel features to analyze the decision process of a DRL-based agent. Nevertheless, the authors still try to interpret the raw input data, and it is difficult to search the objects and analyze the complicated relationships with high-dimensional data. In addition, certain studies introduced causal models for DRL agents [27, 28]. However, causal models have to be defined previously, and their fatal defect is that they cannot discover unknown causal factors from observations. An overview of the approaches is shown in Table 1. In this work, we borrow ideas from the perturbation-based approach and propose our method based on conceptual embedding techniques. We try to explain the causality between an agent’s action and its observation in a reduced representation space. To make the features more explicit and more interpretable for human beings, we propose building the conceptual features in the hidden layers of DNN-based models and analyzing the effective information contained in the features for an agent’s policy. Furthermore, we can analyze the relationship between an agent’s upper bound performance and its observable information.

Table 1 Overview of interpretable DRL approaches

3 Conceptual embedding in DRL

For the concept of model interpretability, researchers may have their own definition with respect to their different points of view. Nevertheless, it will not be incorrect to conclude that a model has good interpretability if we can track and detail the decision process of the model. In this work, we aim to track the main decision process of DRL agents and to discover the salient features and interpretable factors that directly impact certain actions of DRL agents. First,we analyze the causality between an agent’s observation and its adopted action. The intermediate representation space of DRL model is analyzed based on information theory. Second, we propose using hierarchical conceptual embedding techniques to build DNN-based architectures in DRL agents so that we can embed prior knowledge into the DNN architectures and constrain the representation spaces of DRL agents. Third, we propose to generate saliency values for the conceptual embedding activations in the DNNs so that we can discover the important factors and track the decision process of DRL agents. Because the conceptual embedding techniques can be utilized for any DNN-based model, it is a general framework that can be applied in different DRL algorithms to embed conceptual factors and interpret the causal factors of deep models.

3.1 Preliminaries

RL [29] can be formulated with Markov decision process (MDP) that can be defined by a six-element tuple \(<\mathcal {S}, \mathcal {A}, \mathcal {T}, s_{0}, r, \gamma >,\) where \(\mathcal {S}\) is a set of states s, \(\mathcal {A}\) is a set of actions a, \(\mathcal {T}\) is a transition function \(s_{t+1} \sim p(s_{t+1}\mid s_{t}, a_{t})\) used to present the state transition probabilities of the environment from time step t to t + 1, s0 is the initial state distribution, r is a reward function to generate the reward at each time step rt+ 1 = r(st,at,st+ 1), and γ ∈ [0,1] is the discount factor.

Generally, DRL refers to the RL methods that use deep neural networks (DNNs) to approximate the value function or/and policy function. DRL usually addresses the partially observable MDP (POMDP) problem [30], in which the agent cannot directly observe the true latent state of the environment. The agent can only receive high-dimensional observable data o generated by the latent state s of the environment and then infer the latent state, expressed as p(so). The framework formalization can be extended to an expanded tuple \(<\mathcal {S}, \mathcal {O}, \mathcal {A}, \mathcal {T}, \mathcal {G}, s_{0}, r, \gamma >\), where \(\mathcal {O}\) is the set of observations o, and \(\mathcal {G}\) is the observation generation function that specifies the probability of observation ot given a certain state st, expressed as \( {o}_{t} \sim p( {o}_{t}\mid s_{t})\). An extension of the POMDP is a partially observable stochastic game (POSG) [31, 32], which is employed for multiagent systems. This work will focus on an agent, so we will not further discuss the POSG.

The DRL framework of an agent is illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of the DRL framework of an agent with a POMDP environment. The objective of the agent is to learn a DNN-based policy π𝜃 to maximize its return, i.e., \(\max \limits _{\pi _{\theta }} {\sum }_{k=0}^{\infty } \gamma ^{k} r_{t+k}\). The environment has a latent state transition function, which is represented as a conditional probability function p(st+ 1st,at). The latent state st+ 1 produces an observation ot+ 1 by p(ot+ 1st+ 1)

The agent, which is a module that receives the observation of the environment ot with the reward signal rt and outputs an action at, repeatedly interacts with the environment. The POMDP environment is influenced by the agent’s action at and changes its latent state to st+ 1 corresponding to observation ot+ 1 with the reward signal rt+ 1. The objective of the agent is to learn an optimal policy π𝜃(ao) to gain maximum future return \(\mathcal {R}\), written as

$$ \mathcal{R} = {\sum}_{k=0}^{\infty}{\gamma^{k}r_{t+k}}, $$
(1)

where γ ∈ [0,1] is the discount factor. To judge the expected future return when an agent adopts a given policy in a certain state, we can use the state value function Vπ(s) with respect to policy π, expressed as

$$ V_{\pi}(s)=\mathbb{E}_{\pi}[{\sum}_{k=0}^{\infty}{\gamma^{k} r_{t+k}\mid s_{t}=s}], $$
(2)

where \(\mathbb {E}_{\pi }\) denotes the expectation under policy π and Vπ(s) is used to estimate the expected return when an agent enters a certain state s. To judge the expected future return when an agent adopts a certain action in a certain state and conducts a given policy in the following step, we can use the state-action value function Qπ(s,a) with respect to policy π, expressed as

$$ Q_{\pi}(s, a)=\mathbb{E}_{\pi}[{\sum}_{k=0}^{\infty}{\gamma^{k} r_{t+k}\mid s_{t}=s,a_{t}=a}], $$
(3)

where Qπ(s,a) is used to estimate the expected return if an agent adopts a certain action a in a certain state s. With regard to DRL methods, DNNs are typically deployed to construct and learn the complicated policy function π𝜃(ao). For instance, the value-based DRL methods such as DQN [1] employ DNNs to fit the state value function Vπ(s) or the state-action value function Qπ(s,a) to guide the policy, and the policy-based DRL methods such as TRPO [4] directly apply DNNs to learn the policy function π𝜃(ao) that directly maps an observation o to an action a. There are also combination methods, such as DDPG [2], A3C [3], and SAC [5]. These methods are termed actor-critic algorithms, whose policy optimization is guided by a learned value function. Although DNNs have great function approximation ability for DRL methods, they also become a main source of poor model interpretability. The end-to-end trained models based on DNNs only consider the results, and the reasons why the results are produced are disregarded.

3.2 Causality between action and observation

For the behavior of an agent, there is a strong causality between the action and the observation. An agent will adopt an action a according to its observation o. Different observations will prodcue different actions. Particularly, assume a well-trained agent with a fixed policy π𝜃(ao) that

$$ p(A=a\mid O={o}) = 1. $$
(4)

Equation (4) means that an observation o will determine an action a of the agent. Therefore, we can also state that the observation is the cause and that the action is the effect. The causality can be denoted as

$$ {o} \rightarrow a. $$
(5)

The observation spaces in terms of DRL problems, such as images, are usually high-dimensional. Current methods try to discover the relationship between actions and observations. It is difficult to directly obtain the causality between a raw high-dimensional observation and an action. Even though the methods can identify the important observation parts, it is still difficult to provide explicit reasons for an agent adopting a certain action.In the following subsection, we introduce an intermediate representation to construct the causal relationship and analyze the process based on information theory.

3.2.1 Intermediate representation embedding

For human beings, high-dimensional sensory signals will form abstract concepts in the brain’s cognitive system, and decisions will be made by reasoning according to the concepts. It is reasonable to map the high-dimensional observation space into a compact representation space for decision-making. The causes in a compact representation space are easier to interpret than in a raw high-dimensional observation space. Therefore, we propose an embedding causality discovery approach for interpreting the behavior of a DRL agent. Instead of directly establishing the causality between an action a and an observation o, we introduce an intermediate representation v, which is a mapping representation from the observation space. We just need to determine the causality between the intermediate representation v and the action a. We obtain v from observation o, denoted as

$$ v = \phi({o}). $$
(6)

The mapping function ϕ(⋅) might be a fixed nonlinear transformation that can map a certain observation or similar observations o into a certain representation v. The mapping function can be a one-to-one mapping or a many-to-one mapping, but not a one-to-many mapping. The representation v is determined by a certain observation o, formulated as

$$ p(V=v\mid O={o}) = 1. $$
(7)

Therefore, the process can be denoted as

$$ {o} \mapsto v \rightarrow a. $$
(8)

The observation o maps to (denoted as ↦) a representation v, and then we use v to deduce action a. We just need to discover the causality between the intermediate representation v and the action a. Therefore, we can transform a high-dimensional observation space to a compressed representation space with good interpretability, in which the main causes of an agent’s behavior are easy to obtain. However, this transformation will prompt a question. Will the compressed representation produce information loss and cause an agent’s performance decline? To answer this question, we analyze the process based on information theory.

3.2.2 Information loss of observation space reduction

The observation space transformation may cause information loss for action decision-making. For instance, action selection will increase uncertainty when the transformation loses certain observation information. Formally, assume that given an observation o, the original action selection conditional probability distribution is

$$ a \sim p(A\mid O={o}). $$
(9)

The uncertainty of the action selection can be measured by conditional entropy as

$$ \begin{aligned} \mathcal{H}(A\mid O) &= {\sum}_{{o}}p({o})\mathcal{H}(A\mid O={o}) \\ &= -{\sum}_{{o}}p({o}){\sum}_{a}p(a\mid {o})log_{2}{p(a\mid {o})} \\ &= -{\sum}_{{o},a}p(a,{o})log_{2}{p(a\mid {o})}. \end{aligned} $$
(10)

The conditional entropy \({\mathscr{H}}(A\mid O)\) will indicate how much information the observation can provide for an agent’s decision. A larger \({\mathscr{H}}(A\mid O)\) means a smaller amount of observation information, and vice versa. After we introduce the intermediate representation c for action decision-making, the action selection conditional probability distribution will be

$$ a \sim p(A\mid C=c). $$
(11)

The uncertainty of the action selection on the basis of the intermediate representation will be

$$ \mathcal{H}(A\mid V) = {\sum}_{h}p(h)\mathcal{H}(A\mid V=v). $$
(12)

The conditional entropy \({\mathscr{H}}(A\mid V)\) will indicate how much information the intermediate representation can provide for an agent’s decision. A larger \({\mathscr{H}}(A\mid V)\) means a smaller amount of information, and vice versa. We assume a rational agent that can make the best decision estimation according to its observation. The more information is observed by the agent, the more certain the behavior of the agent. The information loss of the transformation from observation space O to intermediate representation space V can be defined by the difference between (12) and (10).

Definition 1 (Observation Information Loss)

The observation information loss from observation space O to intermediate representation space V can be measured by the increased uncertainty in an agent’s behavior according to (10) and (12), as

$$ \mathcal{L}_{V} = \mathcal{H}(A\mid V)-\mathcal{H}(A\mid O). $$
(13)

A larger \({\mathscr{L}}_{V}\) means a larger amount of information loss when the observation space O maps to intermediate representation space V. A good intermediate representation space V should have low observation information loss. Hence, (13) can be a metric for optimizing the mapping function (6).

Intuitively, the observation information loss will reduce the performance of the agent. It is necessary to minimize \({\mathscr{L}}_{V}\) in the agent training phase, such as introducing an additional loss term to the objective function.

There is also a problem of how to estimate the probability distributions p(ao) and p(o). It may be impossible to know the precise latent distributions of the environment. However, one approximation method is to count the historical observation o proportion and observation-action pair (o,a) proportion. The representation v proportion and observation-action pair (o,a) proportion can also be counted by the same experience because each observation o can be mapped to the corresponding representation v.

3.3 Hierarchical conceptual embedding

Due to the increasingly high-dimensional state space and action space, DRL agents have to explore exponentially growth spaces that face the curse of dimensionality. A hierarchical learning architecture is a way to solve this problem by dividing a large problem into small subproblems and by conquering each subproblem to solve the high-level problem [33]. The hierarchical learning architecture partially splits the complicated problem while detailing the multilevel learning process from an easy level to a difficult level, so it can make the model easy to interpret. To explicitly demonstrate the interpretable factors aligned with human concepts, the conceptual embedding technique [34] can be utilized to embed prior knowledge in DNN models. We propose combining a hierarchical learning architecture and conceptual embedding techniques to improve the interpretability for DRL agents, as illustrated in Fig. 2.

Fig. 2
figure 2

Illustration of hierarchical conceptual embedding architecture for policy π𝜃(ao) approximation. FN means free neuron, and CN means conceptual neurons [34]

The DNN-based policy model of the DRL agent produces actions according to observations from the environment. Generally, the model is a black-box for humans, and we are concerned with only its input and output. The representation spaces of the hidden layers are often meaningless and too complicated to be understood by our concepts. Interpretation of all the variables of the hidden neurons surpass human cognitive ability. Nonetheless, we can catch the important factors and reduce them to a human-interpretable representation space.

The DNN-based model can be regarded as a complex function that maps an observation o to an action a through multilayer nonlinear transformation,

$$ a = f({o}). $$
(14)

When we backtrack the DNN architecture from the output action a layer by layer. The computational result of a is obtained from a previous subfunction that selects the previous computational result (x1,c1) as input,

$$ a = f_{1}(x_{1}, c_{1}), $$
(15)

where x1 and c1 denote the activation values of FNs and CNs respectively, in macropolicy representation layer 1, and f1(⋅) is a nonlinear transformation function. The activation values of CNs c1 are meaningful representations related to human concepts. Similarly, we can continue to backtrack to the next layer,

$$ (x_{1}, c_{1}) = f_{2}(x_{2}, c_{2}), $$
(16)

where x2 and c2 denote the activation values of FNs and CNs respectively in macro-policy representation layer 2, and f2(⋅) is another nonlinear transformation function.

We can also continue to backtrack to the input layer in a similar way. However, the shallow layers near to the input layer are often employed to extract shallow features, such as the edge information in images. We are mainly concerned with the deep layers near the output layer that have abstracted high-level representations, such as macro strategy targets.

When an event is interpretable, it usually means that we can ascertain the reasons and factors related to the event. For an agent, if we can grasp the factors that caused an agent to adopt a certain action, we can determine that the agent’s behavior is interpretable. Therefore, to interpret the agents’ behavior, we can discover some latent variables related to the adopted actions. The Pearson correlation coefficient (PCC) is used to measure the correlation between two variables. In our case, we can use the PCC to measure the relationship between the activation of conceptual neuron c and the corresponding probability of a specific action pa = p(ac) as

$$ \rho(c, p_{a}) = \frac{{\sum}_{c}(c-\overline{c})(p_{a}-\overline{p_{a}})} {\sqrt{{\sum}_{c}(c-\overline{c})^{2}}\sqrt{{\sum}_{c}(p_{a}-\overline{p_{a}})^{2}}}, $$
(17)

where \(\overline {c}\) is the mean of c, \(\overline {p_{a}}\) is the mean of pa, and ρ(c,a) ∈ [− 1,1]. When ρ(c,a) > 0, there is a positive correlation between c and p(ac). The positive correlation becomes stronger as the value increases. When ρ(c,a) < 0, it means a negative correlation between c and p(ac). The negative correlation becomes stronger as the absolute value increases. The case ρ(c,a) = 0 also exists, which makes it difficult to interpret the relationship between c and p(ac). We hope that the latent embedding c has a strong correlation with the probability pa of action a. The value of c will influence the probability of adopted action a. Therefore, to the original objective function for training, we can add an extra optimization objective, expressed as

$$ \max \mid {\sum}_{c}\rho(c, p_{a})\mid . $$
(18)

3.4 Perturbation technique for conceptual embedding

Saliency maps are usually applied to perturb the input layer and to observe the changes in the output layer. In our case, we apply the techniques of saliency maps with the hidden layers. As we embed certain interpretable prior-knowledge into the hidden layers, we can also generate saliency maps for the conceptual embedding factors using the related methods.

Typically, the saliency of specific features can be measured by policy difference as (19), by the value difference as (20), or by the Q-value difference as (21) between original observation o and perturbed observation \( {o}^{\prime }\) with respect to feature f.

$$ S[f] = {\|\pi({{o}}) - \pi({{o}^{\prime}})\|}. $$
(19)
$$ S[f] = {\|V_{\pi}({o}) - V_{\pi}({o}^{\prime})\|}. $$
(20)
$$ S[f] = {\|Q_{\pi}({o},a) - Q_{\pi}({o}^{\prime},a)\|}, $$
(21)

where S[f] is the salient metric value of feature f, π(⋅) is the policy function that outputs an action according to a given observation, Vπ(⋅) is the observation value function that directly judges the state value according to an observation, and Qπ(⋅,⋅) is the observation-action value function that judges the state-action value according to an observation and an action. ∥⋅∥ denotes the norm function, typically the 2 norm, used to calculate the distance between two vectors.

Equations (20) and (21) presented in [19] judge the summarized difference over all actions rather than specific actions or factors. Therefore, they cannot highlight the saliency maps for the specific factor. Iyer et al. [26] adopted the Q-value difference as shown in (21), which is somewhat specific to the actions. However, they still could not provide distinct and sufficient interpretability for the DRL agents and disregared the exclusive effects on certain actions.

We propose to generate saliency values for the interpretable CNs in the hidden layers to provide deeper insight into the behaviors of DRL agents and to track the factors that influence the decisions. For the CN activation c, we can also measure the difference between the original activation and the perturbed activation with respect to feature f as

$$ S_{c}[f] = \|f_{c}({o}) - f_{c}({o}^{\prime})\|, $$
(22)

where fc(⋅) denotes a function, which is a part of the feedforward computation of the DNN model, outputting activation c. For more general cases, the measured activation c can also be a CN or a group of CNs.

According to (22), we can generate the saliency values for all the CNs and actions. As a result, we can backtrack the salient decision factors from the action output layer.

On the other hand, we can also regard the CNs as input for the action output. Hence, we can also judge the influence of each CN activation c based on the final action decisions, represented as

$$ S_{a}[c] = \|f_{a}(c) - f_{a}(c^{\prime})\|, $$
(23)

where fa(⋅) denotes the transformation function to the action output layer, and \(c^{\prime }\) denotes the perturbed c.

3.5 Measurements of model performance

To improve an agent’s performance and interpretability, we must quantify the performance and interpretability. Informally, an agent can achieve good performance when it can receive enough information from the environment, and we can achieve a model with good interpretability when we can identify the key factors that drive the agent adopt a certain action.

3.5.1 Causality completeness

With regard to an agent, its observation will determine its action. To interpret the DRL model, we discover the key causal factors contained in the observation that influence the agent’s action. We can measure the model interpretability from the aspect of whether the conceptual neurons can provide all the reasons for a certain action, that is, whether the set of conceptual neurons can provide enough information to determine an agent’s action. Hence, our task is to measure the information completeness of the conceptual neurons, i.e., interpretable factors, with respect to the agent’s observation.

Without any observation and prior knowledge, we can estimate that all the actions have the same probability according to the principle of maximum entropy. Assume that the number of available actions is m, the uncertainty of an agent’s behavior A can be measured by information entropy, represented as

$$ \mathcal{H}[p(A)] = -{\sum}_{a}p(a)log_{2}{p(a)} = log_{2}{m}. $$
(24)

Definition 2 (Complete Causal Factor Set)

If a set of causal factors \(V = \{V_{1}, \dots , V_{n}\}\) completely determines an agent’s action A, the set is a complete causal factor set. Formally, given the specific causal factor \(v = \{v_{1}, \dots , v_{n}\}\), the best estimated action of the agent can be determined as

$$ p(A=a\mid v) = 1. $$
(25)

The behavior uncertainty can be defined by conditional entropy as

$$ \mathcal{H}[p(A\mid V)] = -{\sum}_{v}p(v){\sum}_{a}{p(a\mid v)log_{2}{p(a\mid v)}}= 0. $$
(26)

The causal factor set V is a complete causal factor set for an agent’s behavior A.

For general cases, we can define the degree to which a causal factor set can impact an agent’s behavior.

Definition 3 (Causal Factor Set Completeness)

Causal factor set completeness can be defined by information entropy as

$$ N = \frac{\mathcal{H}[p(A)] - \mathcal{H}[p(A\mid V)]}{\mathcal{H}[p(A)]}, $$
(27)

where N ∈ [0,1].

Similar to Definition 3, we can define the causal factor set completeness with respect to conceptual causal factor C.

Definition 4 (Conceptual Causal Factor Set Completeness)

Conceptual causal factor set completeness can be defined as

$$ N_{C} = \frac{\mathcal{H}[p(A)] - \mathcal{H}[p(A\mid C)]}{\mathcal{H}[p(A)]}. $$
(28)

Conceptual causal factor set completeness describes the degree to which the conceptual factor set C impacts the agent’s action. We can use NC to measure the interpretability of an agent’s behavior. A larger NC means better interpretability. NC = 0 means that the agent’s behavior is totally unpredictable in terms of conceptual causal factors and that we cannot interpret the agent’s behavior. The agent’s behavior depends on conceptual causal factors when NC = 1, and we can interpret all the agent’s behaviors through the interpretable causal factors.

Definition 5 (Complementary Causal Factor Set)

If the conceptual causal factor set C combined with another factor set X can determine the action, i.e.,

$$\mathcal{H}[p(A\mid C,X)] = 0,$$

then the factor set X is a complementary free causal factor set of conceptual factor set C. The causal factor set completeness of the combination (C,X) is

$$N_{C,X} = \frac{\mathcal{H}[p(A)] - \mathcal{H}[p(A\mid C,X)]}{\mathcal{H}[p(A)]} = 1.$$

3.5.2 Observation information and agent performance

Does a causal factor set contain the full observation information for decision-making? How much information will the intermediate causal representations lose, and to what degree will the model performance be reduced? With these questions, we need to analyze the relationship between the observation information and the agent’s performance.

Definition 6 (Complete Observation Information)

Assume that the best estimation of the action distribution with respect to an observation o is \(\hat {p}(a\mid {o})\). If a set of causal factors v = ϕ(o) has the same optimal estimation \(\hat {p}(a\mid v) = \hat {p}(a\mid {o})\), v contains the complete observation information.

Lemma 1

Let viV be the reduced mapping of a subset of observations {o}iO, denoted by vi = ϕ({o}i). For any set {o}iO, if and only if all the observations o ∈{o}i have the same optimal estimation \(\hat {p}(a\mid {o} \in \{ {o}\}_{i})\), all vi in the representation space V contains complete observation information.

Proof

For any representation vi = ϕ(o), in which o belongs to any subset {o}iO, the optimal estimation of the action distribution with respect to vi will be the mean of \(\hat {p}(a\mid {o} \in \{ {o}\}_{i})\), written as

$$ \hat{p}(a\mid v_{i}) = \frac{1}{n_{i}}{\sum}_{{o} \in \{{o}\}_{i}}\hat{p}(a\mid {o}), $$
(29)

where ni is the number of observations in {o}i. If all the observations o ∈{o}i have the same optimal estimation \(\hat {p}(a\mid {o} \in \{ {o}\}_{i})\), it has an optimal estimation

$$ \hat{p}(a\mid v_{i}) = \hat{p}(a\mid {o}), $$
(30)

i.e., the representation vi has the same optimal estimation as its original observation o. Hence, all vi in the representation space V contain complete observation information. Otherwise, if two observations oa,ob ∈{o}i have different optimal estimations than \(\hat {p}(a\mid {o}_{a}) \neq \hat {p}(a\mid {o}_{b})\), an optimal estimation does not exist for vi satisfying all observations in subset {o}i, e.g.,

$$ \hat{p}(a\mid v_{i}) = \hat{p}(a\mid {o}_{a}) = \hat{p}(a\mid {o}_{b}). $$
(31)

The representation vi loses certain information that will degrade the agent performance. □

Definition 7 (Optimal Policy Offset)

Assume that the best estimation of the action distribution with respect to an observation o is \(\hat {p}(a\mid {o})\). If a set of intermediate representations v = ϕ(o) produces a different best estimation \(\hat {p}(a\mid v) \neq \hat {p}(a\mid {o})\), the policy based on v is changed. The optimal policy offset \(\mathcal {D}_{v}\) in terms of v can be measured by KL-divergence as

$$ \mathcal{D}_{v,o} = {\sum}_{a} \hat{p}(a\mid v)log_{2}\frac{\hat{p}(a\mid v)}{\hat{p}(a\mid {o})}. $$
(32)

Intuitively, the observation information can promote the estimated performance of an agent.

Definition 8 (Agent Performance Upper Bound)

In a certain task environment, an agent will have a performance upper bound \(\sup (\overline {\mathcal {R}})\) given the limited observation space \(\mathcal {O}\).

$$ \sup(\overline{\mathcal{R}}) = \max_{\pi(a\mid {o})} \mathbb{E}_{\pi(a\mid {o})}[{\sum}_{k=0}^{\infty}{\gamma^{k} r_{t+k}}] $$
(33)

After the representation transformation from \(\mathcal {O}\) to V, there might be a reduction in the agent performance upper bound, represented as

$$ {\Delta} \overline{\mathcal{R}} = \sup(\overline{\mathcal{R}}_{O}) - \sup(\overline{\mathcal{R}}_{V}), $$
(34)

where \(\sup (\overline {\mathcal {R}}_{O})\) denotes the agent performance upper bound in the condition of the original observation space, and \(\sup (\overline {\mathcal {R}}_{V})\) denotes the agent performance upper bound in the condition of the transformed representation space.

Theorem 1

For any subset {o}iO mapping to a representation vi in the representation space V = ϕ(O), all the observations o ∈{o}i should have the same optimal estimation \(\hat {p}(a\mid {o} \in \{ {o}\}_{i})\) to guarantee no reduction of the agent performance upper bound, i.e.,

$$ {\Delta} \overline{\mathcal{R}} = 0. $$

Proof

According to Lemma 1, if and only if all the observations o ∈{o}i have the same optimal estimation \(\hat {p}(a\mid {o} \in \{ {o}\}_{i})\), an agent can achieve the same optimal policy estimation \(\hat {p}(a\mid v_{i}) = \hat {p}(a\mid {o} \in \{ {o}\}_{i})\). According to (32), the optimal policy offset \(\mathcal {D}_{v_{i}, {o} \in \{ {o}\}_{i}}=0\). For any observation o that belongs to any subset {o}iO, the agent can still achieve optimal policy estimation after mapping to representation space V. The expected returns can remain the same, i.e.,

$$ \mathbb{E}_{\pi(a\mid v)}[{\sum}_{k=0}^{\infty}{\gamma^{k} r_{t+k}}] = \mathbb{E}_{\pi(a\mid {o})}[{\sum}_{k=0}^{\infty}{\gamma^{k} r_{t+k}}]. $$

Hence, \(\sup (\overline {\mathcal {R}}_{V}) = \sup (\overline {\mathcal {R}}_{O})\). According to (34), we can obtain \({\Delta } \overline {\mathcal {R}} = 0\). □

Theorem 1 states that improving the interpretability of the DRL model does not mean sacrificing the model performance. In theory, an agent can achieve the performance upper bound if the interpretable representation space contains complete observation information. However, an agent cannot often learn the best policy because of the information loss of the representation space transformations and the instability of the DRL algorithms.

4 Experiments

First, we design a simple environment that can accurately calculate the information loss and an agent’s performance upper bound to clearly demonstrate the analysis process. Second, we choose a complex environment, that does not know the latent state transition function to validate the effectiveness of the proposed method and the analyses.

4.1 Computational analyses in a naive environment

4.1.1 Environment setup

Assume a naive experimental environment in which a machine (environment) provides a box to a monkey (agent) at each time. There are three lights of red, green, and blue on the box that the monkey can observe and then choose whether to open the box. When the box turns red and green, there is a banana in it. The monkey can open the box and retrieve the banana (obtain a reward). In the other cases, the monkey will receive an electric shock (obtain a punishment) if it opens the box. The machine is defined as a tuple \(<\mathcal {S}, \mathcal {O}, \mathcal {A}, \mathcal {T}, \mathcal {G}, s_{0}, r, \gamma >\):

  • \(\mathcal {S}: \{0, 1, 2, 3\}\)

  • \(\mathcal {O}: \{(0,0,0),(0,1,1),(1,0,1),(1,1,0)\}\)

  • \(\mathcal {A}: \{0, 1\}\)

  • \(\mathcal {T}: \left \{\begin {array}{ll} p(s_{t+1} = s_{t} \mid s_{t}\neq 3, a_{t} = 1) = 1 \\ p(s_{t+1} = \xi \mid s_{t}=3, a_{t} = 1) = 0.25\\ p(s_{t+1} = \xi \mid s_{t}, a_{t} = 0) = 0.25\\ \xi \in \{0,1,2,3\} \end {array}\right .\)

  • \(\mathcal {G}: \left \{\begin {array}{ll} p(o=(0,0,0)\mid s=0) = 1 \\ p(o=(0,1,1)\mid s=1) = 1 \\ p(o=(1,0,1)\mid s=2) = 1 \\ p(o=(1,1,0)\mid s=3) = 1 \end {array}\right .\)

  • \(s_{0}: \left \{\begin {array}{ll} p(s=0) = 0.25 \\ p(s=1) = 0.25 \\ p(s=2) = 0.25 \\ p(s=3) = 0.25 \end {array}\right .\)

  • \(r: \left \{\begin {array}{ll} r(s_{t}\neq 3, a_{t}=1, s_{t+1}) = -1 \\ r(s_{t}=3, a_{t}=1, s_{t+1}) = 1 \\ r(s_{t}, a_{t}=0, s_{t+1}) = 0 \end {array}\right .\)

  • γ : 0.9

Based on the naive example, it is easy to demonstrate the relationship between the observation information and the performance upper bound of an agent.

4.1.2 Analysis of an agent’s performance

In the case that the agent can observe the full observation, the upper bound of an agent’s performance is

$$ \sup(\overline{\mathcal{R}}_{fo}) = {\sum}_{k=0}^{\infty}{0.9^{k} \times 0.25 \times 1} = 2.5. $$

We calculate the upper bound of the agent performance after the agent converges to s = 3 for computational simplicity. The upper bound performance can be reached when the agent adopts the policy π(ao) as

$$ \pi(a\mid o) : \left\{\begin{array}{ll} p(a=0\mid o=(0,0,0)) = 1\\ p(a=0\mid o=(0,1,1)) = 1\\ p(a=0\mid o=(1,0,1)) = 1\\ p(a=1\mid o=(1,1,0)) = 1 \end{array}\right. $$

Assume that the agent’s policy model extracts the features of the observation o to an intermediate causal factor \(\tilde {s}\) that estimates the latent state of the environment before making a decision, as

$$ o \mapsto \tilde{s} \rightarrow a $$

Let the mapping distribution be

$$ \phi_{1}(o) : \left\{\begin{array}{ll} p(\tilde{s}_{1}=0, \tilde{s}_{2}=0\mid o=(0,0,0)) = 1\\ p(\tilde{s}_{1}=0, \tilde{s}_{2}=1\mid o=(0,1,1)) = 1\\ p(\tilde{s}_{1}=1, \tilde{s}_{2}=0\mid o=(1,0,1)) = 1\\ p(\tilde{s}_{1}=1, \tilde{s}_{2}=1\mid o=(1,1,0)) = 1 \end{array}\right. $$

We know that the best policy can be determined by causal factors \(\tilde {s}_{1},\tilde {s}_{2}\) as

$$ \pi(a\mid \tilde{s}_{1},\tilde{s}_{2}) : \left\{\begin{array}{ll} p(a=0\mid \tilde{s}_{1}=0, \tilde{s}_{2}=0) = 1\\ p(a=0\mid \tilde{s}_{1}=0, \tilde{s}_{2}=1) = 1\\ p(a=0\mid \tilde{s}_{1}=1, \tilde{s}_{2}=0) = 1\\ p(a=1\mid \tilde{s}_{1}=1, \tilde{s}_{2}=1) = 1 \end{array}\right. $$

Therefore, \(\{\tilde {s}_{1},\tilde {s}_{2}\}\) is a complete casual factor set according to Definition 2. However, for the set \(\{\tilde {s}_{1}\}\), the best action cannot be determined. Assume that the occurrences of the environment states have equal probability. The best policy is

$$ \pi(a\mid \tilde{s}_{1},\tilde{s}_{2}) : \left\{\begin{array}{ll} p(a=0\mid \tilde{s}_{1}=0) = 1\\ p(a=0\mid \tilde{s}_{1}=1) = 0.5\\ p(a=1\mid \tilde{s}_{1}=1) = 0.5 \end{array}\right. $$

According to (23), we can calculate the saliency values of the intermediate causal factors \(\tilde {s}_{1}\) and \(\tilde {s}_{2}\) as

$$ \begin{aligned} S_{a}[\tilde{s}_{1}=0] &= \| f_{a}(\tilde{s}_{1}=0) - f_{a}(\tilde{s}^{\prime}_{1}=1)\| \\ &=p(a=0\mid \tilde{s}^{\prime}_{1}=1) \times 0 + p(a=1\mid \tilde{s}^{\prime}_{1}=1) \times 1\\ &= 0.5\times 0 + 0.5\times 1 \\ &= 0.5, \end{aligned} $$
$$ \begin{aligned} S_{a}[\tilde{s}_{1}=1] &= \|f_{a}(\tilde{s}_{1}=1) - f_{a}(\tilde{s}^{\prime}_{1}=0)\| \\ &=p(a=0\mid \tilde{s}_{1}=1) \times 0 + p(a=1\mid \tilde{s}_{1}=1) \times 1\\ &= 0.5\times 0 + 0.5\times 1 \\ &= 0.5. \end{aligned} $$

Similarly,

$$ S_{a}[\tilde{s}_{2}=0] = 0.5, $$
$$ S_{a}[\tilde{s}_{2}=1] = 0.5. $$

According to (27), the causal factor set completeness of \(\{\tilde {s}_{1}\}\) is

$$ N_{\tilde{s}_{1}} = 0.5. $$

In the same way, we can calculate the causal factor set completeness of \(\{\tilde {s}_{2}\}\),

$$ N_{\tilde{s}_{2}} = 0.5. $$

According to Definition 5, \(\tilde {s}_{1}\) and \(\tilde {s}_{2}\) are complementary causal factor sets. The action can be determined by considering the two factors of \(\tilde {s}_{1}\) and \(\tilde {s}_{2}\) for the causal factor set completeness of the combination \(N_{\tilde {s}_{1},\tilde {s}_{2}} = 1\).

Notably, we can further reduce the dimension of \(\tilde {s}\) with respect to another mapping distribution as

$$ \phi_{2}(o) : \left\{\begin{array}{ll} p(\tilde{s}=0\mid o=(0,0,0)) = 1\\ p(\tilde{s}=0\mid o=(0,1,1)) = 1\\ p(\tilde{s}=0\mid o=(1,0,1)) = 1\\ p(\tilde{s}=1\mid o=(1,1,0)) = 1 \end{array}\right. $$

We know that the best policy in terms of \(\tilde {s}\) is

$$ \pi(a\mid \tilde{s}) : \left\{\begin{array}{ll} p(a=0\mid \tilde{s}=0) = 1\\ p(a=1\mid \tilde{s}=1) = 1 \end{array}\right. $$

For \(\tilde {s} = \phi (o)\), the best estimation of the action distribution has not been changed, i.e., \(\hat {p}(a\mid \tilde {s}) = \hat {p}(a\mid o)\). Therefore, the causal factor \(\tilde {s}\) preserves complete observation information that will not degenerate the model performance. The optimal policy offset is

$$\mathcal{D}_{h,o} = 0.$$

In this example, it is also easy to calculate the lower bound of the agent performance when the agent always sets a = 1 and remains on the states that s≠ 3 as

$$ \inf(\overline{\mathcal{R}}) = -10. $$

To demonstrate the upper bound of an agent’s performance in this environment, we use an agent based on Q-learning and an agent based on DQN to fit the environment. The experimental results are illustrated in Fig. 3.

Fig. 3
figure 3

The training process illustration of the return of two agents based on Q-learning and DQN in terms of \(\tilde {s}\)

4.1.3 Analysis of the agent’s performance reduction

Consider the POMDP case in which the agent can only partially observe the first dimension of the observation and where the occurrence of each state has equal probability. The conditional distribution p(so) is

$$ p(s\mid o): \left\{\begin{array}{ll} p(s=0\mid o=(0,)) = 0.5 \\ p(s=1\mid o=(0,)) = 0.5 \\ p(s=2\mid o=(1,)) = 0.5 \\ p(s=3\mid o=(1,)) = 0.5 \end{array}\right. $$

The best policy of the agent would be to adopt a = 0 all the time. Because it cannot verify whether s = 3 or s = 2 when it observes o = (1,), it might maintain s = 2 if it adopts a = 1. The upper bound performance of the agent changes to

$$ \sup(\overline{\mathcal{R}}_{po}) = 0. $$

Due to the information loss, there is a reduction in the agent performance upper bound, which is

$$ {\Delta} \overline{\mathcal{R}} = \sup(\overline{\mathcal{R}}_{fo}) - \sup(\overline{\mathcal{R}}_{po}) = 2.5. $$

To demonstrate the upper bound of an agent’s performance in the POMDP environment, we again use the same two agents based on Q-learning and DQN to fit the partially observable case. The experimental results are illustrated in Fig. 4.

Fig. 4
figure 4

Training process illustration of the return of two agents based on Q-learning and the DQN that can only observe the first dimension of o

4.2 Experiments in a complex environment

4.2.1 Environment setup

To demonstrate the interpretable methods, we perform experiments in Atari game environments with the Open AI Gym API. The observation space of Atari environments is a video screen, which displays frames of color images of 210 × 160 × 3 pixels. The action space is comprised of 18 discrete numbers corresponding to the buttons of the joystick controller, and the different games use different minimal sets of numbers. The reward can generally be defined by the game scores.

We choose the game of Breakout-v4 as our experimental environment and design three observation spaces for the agent to examine the importance of the observable information for an agent’s performance upper bound.

  • Coordinates of the ball and pad.

    $$ \mathcal{O}_{1}: (x_{1}, y_{1}, x_{2}, y_{2}), $$

    where x1 and y1 are the horizontal coordinate of the ball and vertical coordinate of the ball respectively, and x2 and y2 are the horizontal coordinate of the pad and vertical coordinate of the pad respectively.

  • Horizontal and vertical relative positions between the ball and the pad.

    $$ \mathcal{O}_{2}: (d_{1}, d_{2}, 0, 0), $$

    where d1 = x2x1, and d2 = y2y1. To keep the same dimensional space with \(\mathcal {O}_{1}\), we use 0 to represent a constant zero dimension. Different states (x1,y1,x2,y2) and \((x^{\prime }_{1}, y^{\prime }_{1}, x^{\prime }_{2}, y^{\prime }_{2})\) will have the same (d1,d2) relative positions if \(x_{2} - x^{\prime }_{2} = x_{1} - x^{\prime }_{1}\) and \(y_{2} - y^{\prime }_{2} = y_{1} - y^{\prime }_{1}\). The agent will lose the information about the distance from the ball to the sidewall, which will make it difficult for the agent to predict the ball trajectories.

  • Horizontal relative position between the ball and the pad.

    $$ \mathcal{O}_{3}: (d_{1}, 0, 0, 0). $$

    Compared with the second case, this case will further lose the vertical coordinate information of the ball. In this case, the agent cannot know whether the ball is near or far from the pad. Therefore, the agent’s best policy may be to try to reduce the horizontal distance between the ball and the pad at all times.

First, we need to estimate an agent’s performance upper bound given a certain observation space. To illustrate the effective information provided by the three different observation spaces \(\mathcal {O}_{1}\), \(\mathcal {O}_{2}\), and \(\mathcal {O}_{3}\), we use the DQN to probe the agent performance upper bound with respect to each observation space, as illustrated in Fig. 5. Note that any DRL algorithm can be utilized to probe the agent performance upper bound if only it can achieve good performance because our aim is to discover the best agent in a given observation space rather than to learn how to train the agent. In this environment, the DQN has achieved a state-of-the-art performance, and we use it for simplicity. We also design a rule-based agent that can achieve a baseline performance for comparison. The rule-based agent only utilizes the information of \(\mathcal {O}_{3}\), and it follows simple control rules. For example, the pad moves left if the ball is on the left side of the pad and the pad moves right when the ball is on the right of the pad. We know that the upper bound performance of an agent will not be less than that of the rule-based agent when an agent can only receive the observation information in terms of the horizontal relative distance between the ball and the pad, i.e., observation space \(\mathcal {O}_{3}\).

Fig. 5
figure 5

Illustration of the training processes of the agents with different observation information in the Atari Breakout-v4 games. DQN-\(\mathcal {O}_{1}\) denotes the agent based on the DQN with observation space \(\mathcal {O}_{1}\), DQN-\(\mathcal {O}_{2}\) denotes the agent based on the DQN with observation space \(\mathcal {O}_{2}\), DQN-\(\mathcal {O}_{3}\) denotes the agent based on the DQN with observation space \(\mathcal {O}_{3}\), and Rule-\(\mathcal {O}_{3}\) denotes the agent using defined control rules with the information of observation space \(\mathcal {O}_{3}\). Each performance curve is illustrated by merging 10 training results

According to the experimental results, the agent performance upper bounds of \(\mathcal {O}_{1}\), \(\mathcal {O}_{2}\), and \(\mathcal {O}_{3}\) will not be less than 400, 325, and 60, respectively, represented as

$$ \sup(\overline{\mathcal{R}}_{\mathcal{O}_{1}}) \geqslant 400, $$
$$ \sup(\overline{\mathcal{R}}_{\mathcal{O}_{2}}) \geqslant 325, $$
$$ \sup(\overline{\mathcal{R}}_{\mathcal{O}_{3}}) \geqslant 60. $$

The agent, which only receives the observation space of \(\mathcal {O}_{1}\), reaches the performance declared in [1]. We can reasonably speculate that the observation space \(\mathcal {O}_{1}\) contains complete observation information of the raw image observation space for this learning task to reach the performance upper bound. To a certain extent, assuming that the DQN agents have achieved the optimal policies, we can estimate the agent performance upper bounds by the experimental results.

4.2.2 Conceptual embedding in decision-making

In many cases, the prior knowledge of human beings is incomplete, which explains why we need DRL-based agents to explore more useful information and policy rules and can help us improve our knowledge in turn. Therefore, we set FNs and CNs in the conceptual embedding framework to learn latent representations that contain complementary information with respect to the prior knowledge of CNs. To observe the complementary information that can be learned by the DRL model, we set two CNs of the observation space \(\mathcal {O}_{2}\) and an FN in the DRL model, as illustrated in Fig. 6. A CN is related to the relative horizontal position d1 between the ball and the pad, and another CN is related to the relative vertical position d2 between the ball and the pad. To embed the concept into the DRL model, we use supervised learning method to align the CNs with the relative horizontal and vertical positions (d1,d2). A designed function is employed to extract the positions of the ball and the pad to calculate the true (d1,d2). In the training process, we add an extra mean square error (MSE) loss between the activation \((\tilde {d}_{1}, \tilde {d}_{2})\) of the CNs and the true (d1,d2).

Fig. 6
figure 6

Illustration of the conceptual embedding architecture for playing Breakout. The architecture can be regarded as two functional parts: the recognition part and decision part. The recognition part, which contains three dense layers, maps the high-dimensional observation data to a low-dimensional representation space. The size of the original color image is 210 × 160 × 3 pixels. We reduce the dimension to a gray image of 1 × 80 × 80 pixels that will not lose effective information for playing the game while it reduces the computational cost of the experiments. The decision part, which also contains three dense layers, estimates the best action according to the extracted information from the compacted representations. The information bottleneck has two conceptual neurons (CNs) and a free neuron (FN). The two CNs are trained to be aligned with the horizontal and vertical relative positions between the ball and the pad (d1,d2). The position function is artificially designed to extract the relative position from the image, and the relative position is employed to guide the training of the two CNs

From the previous experimental results illustrated in Fig. 5, we know that the agent performance upper bound has a score of approximately the 325 when the observable information is provided by only two CNs of the observation space \(\mathcal {O}_{2}\). The FN is used to extract extra information, which can increase the agent performance upper bound. The training result of the conceptual embedding architecture is illustrated in Fig. 7. We can observe that the agent performance upper bound of the conceptual embedding architecture has an increased score of 350 compared with the DQN agent with observation space \(\mathcal {O}_{2}\). The experimental results validate that the FN could provide extra information to promote the agent performance upper bound.

Fig. 7
figure 7

Illustration of the training process of the conceptual embedding architecture compared with the DQN agent with observation space \(\mathcal {O}_{2}\). DQN-\(\mathcal {O}_{2}\) denotes the DQN agent with observation space \(\mathcal {O}_{2}\). DQN-CF denotes the DQN agent that adopts the conceptual embedding architecture. The DQN-CF performance curve is illustrated by merging 5 training results

However, we observe that the FN has not extracted complete complementary information with respect to observation space \(\mathcal {O}_{1}\), in which an agent can reach a score of 400. In the experiments, we also try to compare the convergence of the DRL model both with conceptual embedding and without conceptual embedding. Unfortunately, we discovered that the DRL agent using the same architecture without conceptual embedding usually could not learn a good representation space and converge to an effective policy. The bottleneck layer of the free training DRL agent cannot extract enough effective information to promote the agent performance. Therefore, it is still a significant problem to extract effective information in the representation space transformation process of the free training DRL model. The problem could be a promising research approach to promote DRL algorithms.

To judge the importance of the causal factors, we can use the perturbation technique described in Section 3.4 to estimate the saliency values of the two CNs and FN. Specifically, we add Gaussian noise \(\mathcal {N}(0, 0.1)\) to the activation values of neurons and calculate the mean difference value between the perturbed actions and the original actions. We use the well-trained DQN-CF model to run ten groups of tests, and each group runs 1000 steps. The experimental results of the calculated saliency values of the free causal factor and conceptual causal factor are illustrated in Table 2. Generally, a large saliency value of a conceptual causal factor corresponds to a small saliency value of a free causal factor. The saliency values of conceptual causal factor c and free causal factor f, denoted by Sa[c] and Sa[f] respectively, can be estimated by the normalized mean of the experimental results, as

$$ S_{a}[c] \approx 0.7099, $$
$$ S_{a}[f] \approx 0.0013. $$

From the saliency values, we can observe that the perturbed free causal factor has a small influence on the agent’s decision. The conceptual causal factor almost dominates the agent’s decision and verifies that the two CNs have extracted most of the observation information for the agent’s decision.

Table 2 Saliency values of free causal factor and conceptual causal factor

Because the saliency values weight the influence of the causal factors for an agent’s decision, the causal factor set completeness of conceptual causal factor c and free causal factor f, denoted by Nc and Nf respectively, can be estimated by the normalized saliency values as

$$ N_{c} \approx 0.9982, $$
$$ N_{f} \approx 0.0018. $$

5 Discussion

In the training process of the experiments, we discovered that the DRL methods were very sensitive to the manually predefined hyperparameters, such as the learning rate, replay memory size, size of training batches, and update frequency of the model. The training results were unstable. The DRL-based agent often cannot explore and converge to an effective strategy, especially in an environment of high-dimensional observation space. Thus, we have to spend substantial amount of time on debugging and searching for many hyperparameters. We suggest that it is indispensable to introduce prior knowledge to increase the interpretability of DRL agents, which can substantially reduce the debugging time and expedite the training process of the agents. Just as people need to teach their generations systematic knowledge and rules, human society can develop common sense. To improve the interpretability of DRL agents, prior knowledge embedding methods will be a potential approach. On the other hand, we determined that the free training DRL methods often could not learn an effective representation space while reserving the maximum observation information. The learned representation space has lost certain key information, which can determine the upper bound performance of an agent. However, in complex unknown environments, we have to estimate the agent performance upper bound. How can the estimated upper bound be guaranteed to be closed to the real upper bound in a certain range? It is still a problem to be researched. Reserving the maximum observation information of the representation space in the DNN-based model could be a promising approach for future work.

The proposed method is a general framework that can be applied in different DRL algorithms to improve model interpretability. In particular, the conceptual embedding techniques can provide interpretable cause factors for certain applications that require good model interpretability and reliability, such as automatic driving and healthcare. For example, an interpretable automatic driving agent should know the conceptual reason that a barrier in front of a car contributes to the braking action. We know that prior concepts are an indispensable component of the conceptual embedding model. However, an agent requires fewer training samples if the concepts can provide sufficient guidance and contain enough information. In many cases, compared with agents that collect training samples by freely exploring, it will cost less to introduce the prior concepts to DRL agents. Nevertheless, it will be interesting to investigate how FNs learn unknown complementary concepts with respect to prior concepts contained in CNs.

6 Conclusion

Interpretability is a key property of DRL agents. For example, why does an agent adopt a certain action? What is the key information of the observation that affect an agent’s performance? We discovered that the difficulties in interpreting DRL agents are mainly attributed to the DNN-based model. We analyzed the decision process of DRL agents based on information theory and identified a relationship between an agent’s observable information and its performance upper bound. To make the DRL agents learn a more interpretable representation space, we proposed using a hierarchical conceptual embedding method and introducing prior knowledge to constrain the representation spaces of the DNN-based model. As demonstrated in the experiments, the method can explicitly indicate the action-driven factors, which can render the decision process of the DRL agent tractable and interpretable. In addition, the method has the benefit that the learning process is more efficient and the model converges faster than free training DRL models.