Keywords

1 Introduction

As important entities in smart grids, microgrids (MGs) are small-scale power supply networks that consist of renewable energy generators, such as wind turbines and solar panels, local electrical consumers and energy storage devices [1]. Each MG is aware of the local energy supply and the demand profiles of other MGs and the nearby power plant such as the energy selling prices using wireless networks [2]. Therefore, microgrids with extra energy can sell energy to other microgrids with insufficient energy to reduce their dependence on the energy generated by the power plants with fossil fuel and save the long-distant energy transmission loss.

Game theory is an important tool to study the energy trading in smart grids [3,4,5,6,7,8]. For example, the energy demand of consumers and the response of utility companies are formulated as a Stackelberg game in [4], yielding a reserve power management scheme to decide energy trading price. The energy trading of a power facility controller to buy energy from the power plant and multiple residential users was studied in [6], which yields a charging-discharging strategy to minimize the total energy purchase cost. The energy exchange game for MGs formulated in [7] analyzes the subjectivity decision of end-users in the energy exchange with prospect theory. The energy exchange game developed in [8] addresses energy cheating with the indirect reciprocity principle.

However, to our best knowledge, the game theoretical study on energy trading among multiple MGs with heterogeneous and autonomous operators and renewable energy supply are still open issues. In this paper, we formulate the energy exchange interactions among interconnected MGs and the power plant as an energy trading game, in which each MG chooses the amount of energy to sell to or purchase from the connected MGs and the power plants in the smart grid based on its battery level, the energy generation model and the trading history. The MGs negotiate with each other on the amount of trading energy according to the time-varying renewable energy generation and power demand of the MGs. The energy generation model such as [13] is incorporated in the energy trading game to estimate the renewable energy generation. The Nash equilibrium (NE) of this game is derived, disclosing the conditions that the MGs are motivated to provide their extra renewable energy to other MGs and purchase less energy from the power plants.

Reinforcement learning techniques, such as Q-learning can be used by smart grids to manage the energy storage and generation. For example, a temporal difference-learning based storage control scheme proposed in [9] for the residential users can minimize the electric bill without knowing the power conversion efficiencies of the DC/AC converters. The Q-learning algorithm based heterogeneous storage control system with multiple battery types proposed in [10] improves the system efficiency. In a two-layer Markov model based on reinforcement learning investigated in [11], generators choose whether to participate in the next days generation process in the power grid to improve both the day-ahead and real-time reliability. However, these works focus on the energy storage and generation rather than the energy trading among the MGs.

In this paper, a Q-learning based energy trading strategy is proposed for the MG to derive the optimal policy via trial-and-errors without being aware of the energy demand model and the storage level of other MGs in the dynamic game. To accelerate the learning speed, we exploit the renewable energy generation model in the learning process and design a hotbooting technique that applies the trading experiences in similar smart grid scenarios to initialize the quality values of the Q-learning algorithm at the beginning of the game. Simulation results show that the hotbooting Q-learning based energy trading scheme further promotes the energy trading among the connected MGs in a smart grid, reduces the reliance on the energy from the power plants, and significantly improves the utility of the MGs.

The rest of this paper is organized as follows: The energy trading game is formulated in Sect. 2, and the NE of the game is provided in Sect. 3. A hotbooting Q-learning based energy trading strategy is proposed for the dynamic game in Sect. 4. Simulation results are provided in Sect. 5, and conclusions are drawn in Sect. 6.

2 Energy Trading Game

We consider an energy trading game consisting of N MGs that are connected with each other and a power plant in the main grid via a substation. Each MG is equipped with renewable power generators, active loads, electricity storage devices, and the power transmission lines connecting with other MGs and the power plant. A microgrid has energy supply from other microgirds, the power plant, and local renewable energy generators based on wind, photovoltaic, biomass, and tidal energy.

The renewable energy generation such as wind power is local-independent, intermittent and time-varying. The amount of the energy generated by renewable power generators in MG i at time k denoted by \(g_{i}^{(k)}\) can be estimated via the power generation history and the modeling method such as [13], yielding an estimated amount of the generated power denoted by \(\hat{g}_{i}^{(k)}\). For simplicity, the estimation error regarding \(g_{i}^{(k)}\) is assumed to follow a uniform distribution, given by

$$\begin{aligned} g_{i}^{(k)}-\hat{g_{i}}^{(k)}\sim G\cdot \text {U}(-1,1), \end{aligned}$$
(1)

where G is the maximum estimation error.

In a smart grid, the energy trading interaction among the MGs can be formulated as an energy trading game that consists of N players. The amount of energy that MG i intends to sell to (or buy from) MG j before the bargaining is denoted by \(x_{ij}^{(k)}\), which is chosen by MG i based on the observed state of the smart grid, such as its battery level, the energy trading prices, and its current energy production, and the energy demand. The trading strategy of MG i at time k is denoted by \(\varvec{x}_{i}^{(k)}=[x_{ij}^{(k)}]_{1\le j\le N}\in \varvec{X}\), where \(\varvec{X}\) is the feasible action set of the MGs and \(x_{ii}^{(k)}\) is the amount of energy that MG i intends to trade with the power plant. If \(x_{ij}^{(k)}>0\), MG i intends to sell its extra energy to MG j or the power plant. If \(x_{ij}^{(k)}<0\), MG i aims to buy energy.

Note that sometimes two MGs intend to sell energy to each other at the same time, i.e., \(x_{ij}^{(k)}x_{ji}^{(k)}>0\). This problem has to be addressed with the energy trading bargaining. The resulting actual trading strategy of MG i at time k is denoted by \( \varvec{y}_{i}^{(k)}=[y_{ij}^{(k)}]_{1\le j\le N}\), where \(y_{ii}^{(k)}\) and \(y_{ij}^{(k)}\) denote the amounts of the energy sold if positive by MG i to the power plant and MG j, respectively, or the amount of the energy purchased from them if negative, with \(|y_{ij}^{(k)}|\le C\), in which C is the maximum amount of energy exchange between two MGs. The time index k is omitted, if no confusion incurs. Therefore, the actual amount of trading energy between MG i and MG j after the bargaining is based on their intention trading interactions and given by

$$\begin{aligned} y_{ij}={\left\{ \begin{array}{ll} -\min (-x_{ij},x_{ji}),&{} \text{ if } x_{ij}<0,\,x_{ji}>0 \\ \min (x_{ij},-x_{ji}),&{} \text{ if } x_{ij}>0,\,x_{ji}<0 \\ 0,&{} \text{ o.w. }\end{array}\right. } \end{aligned}$$
(2)

In this way, we can ensure that \(y_{ij}+y_{ji}=0\), \(\forall \, i\ne j\). The amount of the energy that MG i trades with the energy plant is given by

$$\begin{aligned} y_{ii}=\sum _{1\le i\ne j\le N}x_{ij}-\sum _{1\le i\ne j\le N}y_{ij}. \end{aligned}$$
(3)

Energy storage devices, such as batteries, can charge energy if the load in the MG is low and discharge if the load is high. The battery level of MG i, denoted by \(b_{i}^{(k)}\), cannot exceed the storage capacity denoted by B, with \(0< b^{(k)}_{i}\le B\). The estimated amount of the local energy demand is denoted by \(d^{(k)}_{i}\), with \(0\le d^{(k)}_{i} \le D_{i}\), where \(D_{i}\) represents the maximum amount of local energy required by MG i. The battery level of MG i depends on the amount of trading energy, the local energy generation, and the energy demand at that time. For the smart grid with N MGs, we have

$$\begin{aligned} b^{(k)}_{i}=b^{(k-1)}_{i}+g^{(k)}_{i} - d^{(k)}_{i}+\sum _{j=1}^{N}y_{ij}^{(k)}. \end{aligned}$$
(4)

The energy gain of MG i, denoted by \(G_i(b)\), is defined as the benefit that MG i obtains from the battery level b, which is nondecreasing with b with \(G(0)=0\). As the logarithmic function is widely used in economics for modeling the preference ordering of users and for decision making [4], we assume that

$$\begin{aligned} G_i(b)=\beta _i\ln (1+b), \end{aligned}$$
(5)

where the positive coefficient \(\beta _i\) represents the ability that MG i satisfies the energy demand of the users.

To encourage the energy exchange among MGs, the local market provides a lower selling price for the trade between MGs denoted by \(\rho ^{-(k)}\) and a higher buying price denoted by \(\rho ^{+(k)}\), compared with the prices offered by the power plant which are denoted by \(\rho _p^{-(k)}\) and \(\rho _p^{+(k)}\), respectively, i.e., \(\rho ^{-(k)}>\rho _p^{-(k)}\) and \(\rho ^{+(k)}<\rho _p^{+(k)}\).

The utility of MG i at time k, denoted by \(u_{i}^{(k)}\), depends on the energy gain and the trading profit, given by

$$\begin{aligned} \begin{aligned} u_{i}^{(k)}(\varvec{y})=&\beta \ln \left( 1+b_{i}^{(k-1)}+g_{i}^{(k)}-d_{i}^{(k)}+\sum _{j=1}^{N}y_{j}\right) -\sum _{j\ne i}^{N}y_{j}\left( \text {I}(y_{j}\le 0)\rho ^{-(k)}\right. \\&\left. + \ \text {I}(y_{j}> 0)\rho ^{+(k)}\right) -y_{i}\left( \text {I}(y_{i}\le 0)\rho ^{-(k)}_{p}+\text {I}(y_{i}> 0)\rho ^{+(k)}_{p}\right) , \end{aligned} \end{aligned}$$
(6)

where \(\text {I}(\cdot )\) be an indicator function that equals 1 if the argument is true and 0 otherwise.

3 NE of the Energy Trading Game

We first consider the NE of the energy trading game with \(N=2\) MGs, which is denoted by \(\varvec{x_{i}^{*}}=[ x_{ij}^{*}]_{1\le j \le 2}\). Each MG chooses its energy trading strategy at the NE state to maximize its own utility, if the other MG applies the NE strategy. By definition, we have

$$\begin{aligned} u_{1}(\varvec{x_{1}^{*}}, \varvec{x_{2}^{*}})\ge u_{1}(\varvec{x_{1}}, \varvec{x_{2}^{*}}), \forall \varvec{x_{1}} \in X \end{aligned}$$
(7)
$$\begin{aligned} u_{2}(\varvec{x_{1}^*}, \varvec{x_{2}})\le u_{2}(\varvec{x_{1}^*}, \varvec{x_{2}^{*}}), \forall \varvec{x_{2}} \in X. \end{aligned}$$
(8)

Theorem 1

The energy trading game with \(N=2\) microgrids and a power plant has an NE (\(\varvec{x_{1}^{*}}\), \(\varvec{x_{2}^{*}}\)) given by

$$\begin{aligned} \varvec{x_{1}^{*}}=&\left[ 0,\frac{\beta }{\rho }-1-b_{1}^{(k-1)}-g_{1}^{(k)}+d_{1}^{(k)}\right] \end{aligned}$$
(9)
$$\begin{aligned} \varvec{x_{2}^{*}}=&\left[ \frac{\beta }{\rho -1}-1-b_{2}^{(k-1)}-g_{2}^{(k)}+d_{2}^{(k)},0\right] , \end{aligned}$$
(10)

if

figure a

Proof

If (11) holds, by (2) and (3), we have \(x_{11}=x_{22}=0\) and \(y_{12}=\min (x_{12},-x_{21})=x_{12}\), and thus (6) can be simplified into

$$\begin{aligned}&u_{1}(\varvec{x_{1}},\varvec{x_{2}^{*}})= \beta \ln \left( 1+b_{1}^{(k-1)}+g_{1}^{(k)}-d_{1}^{(k)} + x_{12}\right) -x_{12}\rho ,\end{aligned}$$
(12)
$$\begin{aligned}&u_{2}(\varvec{x_{1}^{*}},\varvec{x_{2}})= \beta \ln \left( 1+b_{2}^{(k-1)}+g_{2}^{(k)}-d_{2}^{(k)}+ x_{21}\right) -x_{21}(\rho -1)+x_{12}^{*}. \end{aligned}$$
(13)

Thus, we have

$$\begin{aligned} \frac{du_{1}(\varvec{x_{1}}, \varvec{x_2^*})}{dx_{12}}=\frac{\beta }{1+b_{1}^{(k-1)}+g_{1}^{(k)}-d_{1}^{(k)} +x_{12}}-\rho , \end{aligned}$$
(14)

and

$$\begin{aligned} \frac{d^2u_{1}(\varvec{x_{1}}, \varvec{x_2^*})}{dx_{12}^{2}}=-\frac{\beta }{\left( 1+b_{1}^{(k-1)}+ g_{1}^{(k)}-d_{1}^{(k)}+x_{12}\right) ^2}< 0, \end{aligned}$$
(15)

indicating that \(u_{1}(\varvec{x_{1}}, \varvec{x_{2}^*})\) is convex in terms of \(\varvec{x_{1}}\). Thus the solution of \(du_{1}(\varvec{x_{1}}, \varvec{x_{2}^*})/dx_{12}=0\) is given by (10). Thus \(u_{1}(\varvec{x_{1}}, \varvec{x_{2}^*})\) is maximized by \(\varvec{x_{1}^{*}}\) in (9), indicating that (7) holds. Similarly, we can prove that (8) holds.

Corollary 1

At the NE of the energy trading game with \(N=2\) MGs if (11) hold, MG 1 buys \(y_{12}^{*}\) amount of energy from MG 2, and the latter sells \(-y_{22}^{*}\) energy to the power plant, with

$$\begin{aligned}&y_{12}^{*}=\frac{\beta }{\rho }-1-b_{1}^{(k-1)}-g_{1}^{(k)}+d_{1}^{(k)}\end{aligned}$$
(16)
$$\begin{aligned}&-y_{22}^{*}=\beta \frac{2\rho -1}{\rho (\rho -1)}+2+\sum _{i=1}^{N}\left( b_{i}^{(k-1)}+g_{i}^{(k)}-d_{i}^{(k)}\right) , \end{aligned}$$
(17)

and the utility of MG 1 and that of MG 2 are given respectively by

$$\begin{aligned} u_{1}=&\beta \left( \ln \frac{\beta }{\rho }-1\right) + \rho \left( 1+b_{1}^{(k-1)}+g_{1}^{(k)}-d_{1}^{(k)}\right) \\ u_{2}=&\beta \left( \ln \frac{1}{\rho -1}-1+\frac{1}{\rho }\right) + \rho \left( 1+b_{2}^{(k-1)}+g_{2}^{(k)}-d_{2}^{(k)}\right) \\&-2-\sum _{i=1}^{2}\left( b_{2}^{(k-1)}+g_{2}^{(k)}-d_{2}^{(k)}\right) . \end{aligned}$$
(18)

4 Energy Trading Based on Hotbooting Q-Learning

The repeated interactions among N MGs in a smart grid can be formulated as a dynamic energy trading game. The amounts of the energy that MG i trades with the power plant and other MGs impact on its future battery level and the future trading decisions of other MGs as shown in (2) and (4). Thus the next state observed by the MG depends on the current energy trading decision, indicating a Markov decision process. Therefore, an MG can use Q-learning to derive the optimal trading strategy without knowing other MGs’ battery levels and energy demand models in the dynamic game. More specifically, the amount of the energy that MG i intends to sell or purchase in the smart grid at time k, i.e. \(\varvec{x_{i}^{(k)}}\), is chosen based on its quality function or Q-function denoted by \(Q_{i}(\cdot )\), which describes the expected discounted long-term reward for each state-action pair. The state observed by MG i at time slot k, denoted by \(\varvec{s}_i^{(k)}\), consists of the current local energy demand, the estimated amount of the renewable energy generated at time k and the previous battery level of the MG, i.e., \(\varvec{s}_i^{(k)}= \left[ {d_i^{(k)}}, {\hat{g}_i^{(k)}}, {b_i^{(k-1)}}\right] \).

The value function \(V_i\left( \varvec{s}\right) \) is the maximal Q function over the feasible actions at state \(\varvec{s}\). The Q function and the value function of MG i are updated, respectively, by the following:

$$\begin{aligned}&Q_i\left( \varvec{s}_i^{(k)},\varvec{x}_i^{(k)}\right) \leftarrow (1-\alpha )Q_i\left( \varvec{s}_i^{(k)},\varvec{x}_i^{(k)}\right) + \alpha \left( u_i^{(k)} + \gamma V_i\left( {\varvec{s}_i^{(k+1)}}\right) \right) \end{aligned}$$
(19)
$$\begin{aligned}&V_i\left( {\varvec{s}_i^{(k)}}\right) = \mathop {\max }\limits _{\varvec{x} \in \varvec{X}} Q_i\left( \varvec{s}_i^{(k)},\varvec{x}\right) , \end{aligned}$$
(20)

where \(\alpha \in (0, 1] \) is the learning rate representing the weight of current experience in the learning process, and the discount factor \(\gamma \in [0, 1] \) indicates the uncertainty of the microgrid regarding the future utility.

The standard Q-learning algorithm initializes the Q-function with an all-zero matrix, which is usually not the optimal value and thus degrades the learning performance at the beginning. Therefore, we design a hotbooting technique to initialize the Q-value based on the training data obtained from the large-scale experiments performed in similar smart grid scenarios. This saves the random explorations at the beginning of the game and thus accelerates the convergence rate. More specifically, we perform I similar energy trading experiments before the start of the game, as shown in Algorithm 1.

figure b
figure c

To balance the exploitation and exploration in the learning process, an \(\epsilon \)-greedy policy is applied to choose the amount of the energy to trade with other MGs and the energy plant, i.e., \(\varvec{x}^{(k)}_{i}\) is given by

$$\begin{aligned} \text {Pr}(\varvec{x}^{(k)}_{i} = \varvec{\Theta }) = {\left\{ \begin{array}{ll} 1 - \epsilon , &{} \varvec{\Theta } = \arg \mathop {\max }\limits _{\varvec{x} \in \varvec{X}} Q_i\left( \varvec{{s}_i^{(k)}}, \varvec{x}\right) \\ { \epsilon \over {|\varvec{X}|}}, &{} \text {o.w.} \end{array}\right. } \end{aligned}$$
(21)

MG i chooses \(\varvec{x}_i^{(k)}\) according to \(\epsilon \)-greedy strategy and negotiates with other MGs to determine the actual amounts of the energy in the trading \(\varvec{y_{i}^k}\) according to (2). As shown in Algorithm 2, the MG observes the reward and the next state. According to the resulting utility \(u_i^{(k)}\), the MG updates its Q function via (20) and (21).

5 Simulation Results

Simulations have been performed to evaluate the performance of the hotbooting Q-learning based energy trading strategy in the dynamic game with \(N=2\) MGs. In the simulation, if not specified otherwise, the energy storage capacity of each MG is \(B=4\), and the energy gain is \(\beta =8\). The local energy demands, the energy trading prices, and the renewable energy generation models of each MG in the simulations are retrieved from the energy data of microgrids in Hong kong in [13]. As benchmarks, we consider the Q-learning based trading scheme and the greedy scheme, in which each MG chooses the amount of selling/buying energy according to its current battery level to maximize its estimated immediate utility.

Fig. 1.
figure 1

Performance of the energy trading strategies in the dynamic game with \(N=2\), \(B=4\) and \(\beta =8\)

As shown in Fig. 1, the proposed Q-learning based energy trading strategy outperforms the greedy strategy with less energy bought from the power plant and a higher utility of the MG. For example, the Q-learning based strategy decreases the average amount of the energy purchased from the power plant by 47.7% and increases the utility of the MG by 11.6% compared with the greedy strategy at the 1500-th time slot in the game. The performance of the Q-learning based strategy is further improved with the hotbooting technique that exploits similar energy trading experiences to accelerate the learning speed. As shown in Fig. 1, the hotbooting Q-learning based energy trading strategy decreases the amount of the energy purchased from the power plant by 33.7% and increases the utility of the MG by 9.5% compared with the Q-learning based strategy at the 1500-th time slot.

6 Conclusion

In this paper, we have formulated an MG energy trading game for smart grids and derived the NE of the game, disclosing the conditions under which the MGs in a smart grid trade with each other and reduce the dependence on the power plant. A Q-learning based energy trading strategy has been proposed for each MG to choose the amounts of the energy to trade with other MGs and the power plant in the dynamic game with time-varying renewable energy generations and power demands. The learning speed is further improved by the hotbooting Q-learning technique. Simulation results show that the proposed hotbooting Q-learning based energy trading technique improves the utility of MG and reduces the amount of the energy purchased from the power plant, compared with the benchmark strategy.