Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Together with the adoption of smarter energy grids comes the idea of deregulating the energy supply and demand through energy markets, where producers are able to sell energy to consumers by using a broker as an intermediary. One of the most dominant energy markets is the tariff market, where small consumers can buy energy from broker agentsFootnote 1 via tariffs. Tariffs are contracts agreed between either a producer or a consumer, and a broker, which entitle both parts the right to trade a certain amount under certain conditions [1]. These conditions might include the payment per amount of energy traded, minimum signup time, signup or early withdraw payments, among others [2]. It is through an open energy market of this kind, which uses tariffs to buy and sell energy that the gross majority of the traded energy takes place. For this reason, this work is focused on proposing a tariff-expert broker agent for the tariff energy markets. We use Power TAC [3], an annual international trading agent competition that gathers experts from different fields and latitudes to validate our proposed broker. Power TAC is a complex simulator of an entire energy grid with producers, consumers and brokers buying and supplying energy. It considers transmission and distribution costs, models many different types of energy generation and storage capacities and uses real climate conditions and user preferences to simulate the environment where brokers should take autonomous decisions.

Several aspects, including the customers’ preferences and the competitions’ offers, were taken into account to design our tariff-expert broker [4], which uses reinforcement learning to generate electric energy tariffs while striving to maximize its utility on the long term. To test our proposed tariff design, we embedded our solution in COLD Energy, a broker agent that considers many other aspects of the smart-grid (like a wholesale day-ahead and spot markets, balancing issues and portfolio management). However, this paper will focus solely on the tariff maker part of COLD Energy.

The paper is structured as follows, in Sect.  we present a general background on Power TAC and the electricity tariff markets. Then we present the most relevant work related to ours. In Sect. 3 we present our tariff-expert contribution embedded in COLD Energy. We present our experimental results in Sect. 4 and close our work with some relevant conclusions.

Fig. 1
figure 1

PowerTAC timeslot cycle including the tariff market operations

2 Power TAC and Tariff Markets

Power TAC [3] is a smart grid [5] simulation platform where a set of brokers compete against each other in an energy market. Power TAC uses a multi-agent approach [6] to simulate a smart grid market, where brokers can buy or sell energy to their customers in two different markets: the wholesale market and the tariff market, however, this paper is focused solely on the tariff market. In the tariff market, the brokers trade energy with their clients by using contracts called tariffs, which include specifications such as price-per-kwh, subscription or early withdrawal fees, periodic payments and, the most important one: price. The experiments on this paper used a particular type of tariff called flat tariff [7]. A flat tariff is a time independent tariff, which offers a fixed price per energy unit disregarding the time, i.e. the time of the day of the day of the week; therefore its only specification is price-per-kwh. Figure 1 shows the Power TAC cycle, including the tariff market period. During this period each broker publishes tariffs, and customers evaluate them and decide if they should subscribe to them. Later on this period the consumption and production operations related to tariffs are executed, and the transaction proceedings are charged either to the brokers or to the customers at the end of each time unit. The time unit used on Power TAC is a timeslot, which represent one simulated hour. The brokers can publish tariffs at any given timeslot. After publishing a tariff, the customers can evaluate the offers and decide if they are to stay with the same tariff or change to any available tariff, which may belong to the same broker or to another one. The objective of every single broker is to publish attractive tariffs, so that the producing-customers want to sell energy to it and the consuming-customers want to buy energy from it. At the end, every broker will receive a utility that depends on the incomes, expenses and unbalance fees charged by the transmission line owner.

3 COLD Energy Tariff-Expert

The strategy proposed on this paper is based on the work done by Reddy and Veloso [8]. In this work a simulation approach was used to investigate a heavily simplified competitive tariff market, where the amount of energy consumed and produced by customers was discretized in blocks, and the daily consumption was a fixed parameter that remained the same through the entire simulation. The paper used five agents (each equipped with a different decision making mechanism), each of them using different actions to alter tariff prices. One of these agents used a Markov Decision Process (MDP) to learn a policy using Q-learning. The states of the Q-learning algorithm consisted of two heuristic elements. One of them captured the broker’s energy balance, determining if more energy was bought than sold or it was the other way around. The second element captured the state of the market by comparing the minimum consumption price and the maximum production price. The paper demonstrated that agents which used the learning strategy overperformed those using a fixed strategy in terms of overall profit, when tested in a simplified scenario.

We tested their proposed learning algorithm on a more complex fixed-tariff market scenario, and developed a learning broker \(B_{L}\) which used an improved market representation based on the one proposed by Reddy, and a new set of actions, which publish a consumption and production price each. In more detail, our learning broker evaluates how did the last production and consumption prices behaved in terms of utility and then picks another action. Each action publishes a new consumption and production tariff with prices \(P_{t,C}^{B_{k}}\) and \(P_{t,P}^{B_{k}}\) respectively. At the end the evaluation period, \(\varPsi _{t,C}\) and \(\varPsi _{t,P}\) represent the amount of energy sold or acquired by the broker respectively. In general terms the literal P will be used to refer to an energy price and \(\varPsi \) to refer to an energy amount. For each evaluation period, the utility function for broker k (\(B_{k}\)) is the one shown in Eq. 1. The first term represents the income total proceedings due to electric energy sale, the second terms corresponds to the amount paid to producers, and the third term represents an inbalance fee.

$$\begin{aligned} u_{t}^{ B_{k}}=P_{t,C}^{ B_{k}} \varPsi _{t,C}-P_{t,P}^{ B_{k}} \varPsi _{t,P}-\theta _{t}|\varPsi _{t,C}-\varPsi _{t,P}| \end{aligned}$$
(1)

Each term in Eq. 1 represents either a monetary income or outcome. So the whole utility represents a monetary amount. All three terms multiply a price per energy unit by an energy amount, yielding a monetary unit. If the difference \(\varPsi _{t,C}-\varPsi _{t,P}\) equals zero, then the broker sold exactly the same amount of energy it bought, so the energy inbalance is zero; and for this reason the inbalance fee is zero as well. The variable \(\theta _{t}\) is the amount the broker has to pay to the transmission line owner per each unit of energy inbalance it generated on the evaluation period.

The utility function from Eq. 1 was used as the MDP’s reward after executing a certain action at a given state while in time t. Our brokers state representation will be described on Sect. 3.3 and its actions on Sect. 3.4.

3.1 Market Model

It is important to mention in first place that the market model was designed with the purpose of being used to maximize the utility in the long term. The environment description, encoded as discrete states depend on some key elements belonging to the tariffs published by other brokers; namely: maximum and minimum consumption prices, and maximum and minimum production prices. These parameters are described in the following way.

Minimum consumption price:

$$\begin{aligned} P_{t,C}^{min} = min_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,C}^{B_k} \end{aligned}$$
(2)

Maximum consumption price:

$$\begin{aligned} P_{t,C}^{max} = max_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,C}^{B_k} \end{aligned}$$
(3)

Minimum production price:

$$\begin{aligned} P_{t,P}^{min} = min_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,P}^{B_k}, \end{aligned}$$
(4)

Maximum production price:

$$\begin{aligned} P_{t,P}^{max} = max_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,P}^{B_k}, \end{aligned}$$
(5)

where \(B_{L}\) represents the learning broker evaluating these parameters and the minimum and maximum prices are taken from a list conformed by the prices of all the other brokers, but not the prices of the learning broker \(B_{L}\). Now we will proceed to explain the MDP we used.

3.2 MDP Description

The MDP used by COLD Energy is shown in Eq. 6.

$$\begin{aligned} M^{B_{L}}= \left\langle S,A,P,R \right\rangle \end{aligned}$$
(6)

where:

  • \(S =\left\{ {s_{i}:i=1,\ldots ,I}\right\} \) is a set of I states,

  • \(A =\left\{ {a_{j}:j=1,\ldots ,J}\right\} \) is a set of J actions,

  • \(P(s,a) \rightarrow s\prime \) is a transition function and

  • R(sa) equals \(u_{t}^{ B_{k=L}}\) and represents the reward obtained for execution action a while in state s.

3.3 States

A series of states were designed so as to provide our learning broker of a discretized version of the market, which considers as well the effect of the actions executed by the other brokers. Specifically the state space S is the set defined by the following tuple:

$$\begin{aligned} S= \left\langle PRS_{t},PS_{t},CPS_{t},PPS_{t} \right\rangle \end{aligned}$$
(7)

where:

  • \(PRS_{t} =\left\{ {rational,inverted}\right\} \) is the price range status at time t and

  • \(PS_{t} =\left\{ {shortsupply,balanced,oversupply}\right\} \) is the portfolio status at time t.

  • \(CPS_{t} =\left\{ {out,near,far,very far}\right\} \) is the consumers price status,

  • \(PPS_{t} =\left\{ {out,near,far,very far}\right\} \) is the producers price status,

The values \(PRS_{t}\) and \(PS_{t}\) capture the relationship between the highest production price and the lowest consumption price, and the balance of the broker \(B_{L}\), respectively. This two parameters were proposed by Reddy and are defined as follows:

$$\begin{aligned} PRS_{t} = \left\{ \begin{array}{ll} rational &{} \text{ if } P_{t,C}^{min} > P_{t,P}^{max} \\ inverted &{} \text{ if } P_{t,C}^{min} \le P_{t,P}^{max} \end{array} \right. \end{aligned}$$
(8)
$$\begin{aligned} PS_{t} = \left\{ \begin{array}{ll} balanced &{} \text{ if } \varPsi _{t,C} = \varPsi _{t,P} \\ shortsupply &{} \text{ if } \varPsi _{t,C} > \varPsi _{t,P} \\ oversupply &{} \text{ if } \varPsi _{t,C} < \varPsi _{t,P} \end{array} \right. \end{aligned}$$
(9)

where:

  • \(P_{t,C}^{min} = min_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,C}^{B_k}\) is the minimum consumption price,

  • \(P_{t,C}^{max} = max_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,C}^{B_k}\) is the maximum consumption price,

  • \(P_{t,P}^{min} = min_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,P}^{B_k}\) is the minimum production price and

  • \(P_{t,P}^{max} = max_{ B_{k} \in B \backslash \left\{ B_{L} \right\} } P_{t,P}^{B_k}\) is the maximum production price

On these equations \(B_{L}\) represents the learning broker evaluating these parameters. So the minimum and maximum prices consider the list conformed by the prices of all the other brokes but not the prices of the learning broker \(B_{L}\).

These two elements of S encode the price actions of the broker related to the prices of the other brokers. These parameters, as coarse as they can be, create a compact representation of a market that might include several brokers publishing many tariffs. This representation’s size will remain unchanged disregarding the latter factors, but at the same time the representation will capture the tariff market price states as a whole, considering the other competing brokers’ tariff publications. The tuple parameters \(CPS_{t}\) and \(PPS_{t}\) can take any of these values: out, close, far, very far and are defined as follows.

$$\begin{aligned} CPS_{t} = \left\{ \begin{array}{ll} out &{} \scriptstyle \text{ if } Top_{ref} \le P_{t-1,C}^{B_{L}} \\ near &{} \scriptstyle \text{ if } Thres_{ref}< P_{t-1,C}^{B_{L}} \le Top_{ref}\\ far &{} \scriptstyle \text{ if } Middle_{ref} < P_{t-1,C}^{B_{L}} \le Thres _{ref}\\ very far &{} \scriptstyle \text{ if } P_{t-1,C}^{B_{L}} \le Middle_{ref} \end{array} \right. \end{aligned}$$
(10)

where:

  • \(Top_{ref} = P_{t,C}^{min}\)

  • \(Middle_{ref} = \frac{P_{t,C}^{min} + P_{t,P}^{min}}{2} \)

  • \(Thres_{ref} = \frac{Top_{ref} + Middle_{ref}}{2} \)

$$\begin{aligned} PPS_{t} = \left\{ \begin{array}{ll} out &{} \scriptstyle \text{ if } Bottom_{ref} \ge P_{t-1,P}^{B_{L}} \\ near &{} \scriptstyle \text{ if } Thres_{ref} \ge P_{t-1,P}^{B_{L}}> Bottom_{ref}\\ far &{} \scriptstyle \text{ if } Middle_{ref} \ge P_{t-1,P}^{B_{L}} > Thres_{ref}\\ very far &{} \scriptstyle \text{ if } P_{t-1,P}^{B_{L}} \ge Middle_{ref} \end{array} \right. \end{aligned}$$
(11)

where:

  • \(Bottom_{ref} = P_{t,P}^{min}\),

  • \(Middle_{ref} = \frac{P_{t,C}^{min} + P_{t,P}^{min}}{2} \)

  • \(Thres_{ref} = \frac{Bottom_{ref} + Middle_{ref}}{2} \)

3.4 Actions

The set of actions is defined as:

$$\begin{aligned} A =\left\{ \scriptstyle {maintain,lower,raise,inline,revert,minmax,wide,bottom}\right\} \end{aligned}$$
(12)

Each one of these actions define how the learning agent \(B_{L}\) determines the prices \(P_{t+1,C}^{B_{L}}\) and \(P_{t+1,P}^{B_{L}}\) for the next timeslot t\(+\)1. These actions were designed so as to provide the broker with several ways to react fast to market changes. It is important to recall that every single action impacts both the production and consumption price features of the next tariffs to be published. These are the specific details of each action:

  • maintain publishes the same price as in timeslot t−1.

  • lower decreases both consumer and producer prices by a fixed amount.

  • raise increases both the consumer and producer prices by a fixed amount.

  • inline sets the consumption and production prices as \(P_{t+1,C}^{B_{L}}=\left\lceil {m_{p}+\frac{\mu }{2}}\right\rceil \) and \(P_{t+1,P}^{B_{L}}=\left\lfloor {m_{p}-\frac{\mu }{2}}\right\rfloor \).

  • revert moves the consumption and production prices towards the midpoint \(m_{p}=\left\lfloor {\frac{1}{2}(P_{t,C}^{min}+P_{t,P}^{min})}\right\rfloor \).

  • minmax sets the consumption and production prices as \(P_{t+1,C}^{B_{L}}=D_{coeff}P_{t,C}^{max}\) and \(P_{t+1,P}^{B_{L}}=P_{t,P}^{min}\), where \(D_{coeff}\) is a number on the interval [0.70, 1.00] which damps the effect of the minmax action over the consumption price.

  • wide increases the consumption price by a fixed amount \(\varepsilon \) and decreases the production price by a fixed amount \(\varepsilon \).

  • bottom sets the consumption price as \(P_{t+1,C}^{B_{L}}=P_{t,C}^{min} \dot{M}argin\), where the production price \(P_{t+1,P}^{B_{L}}=P_{t,P}^{min}\). The Bottom action is market-bounded.

Fig. 2
figure 2

Overall average and standard deviation for each broker

3.5 State/Action Flow Example

To illustrate an action’s effect over the consumption and production prices, Fig. 2 shows a simple simulated flow on a series of actions. The actions appear above the graph. On this hand-made simple scenario COLD Energy competes against two brokers, who publish one consumption and one production tariff each. The horizontal axis represents the time measured in decision steps, the vertical axis corresponds to the energy price. The dashed lines are fixed references, while the continuous lines are the published prices as described below:

  • maxCons: corresponds to \(P_{t,C}^{max}\) and is equal to 0.5. It can be assumed that competing broker A published a consumption tariff with this price.

  • minCons: corresponds to \(P_{t,C}^{min}\) and is equal to 0.4. It can be assumed that competing broker B published a consumption tariff with this price.

  • minProd: corresponds to \(P_{t,P}^{min}\) and \(P_{t,P}^{max}\); which means that the maximum and minimum production prices are the same and is equal to 0.015. It can be assumed that both brokers A and B published a production tariff with this price.

  • Cons: corresponds to the consumption price published by COLD Energy.

  • Prod: corresponds to the production price published by COLD Energy.

COLD Energy will bound the price range of its tariffs in the range [\(P_{t,P}^{max}\), \(P_{t,C}^{min}\)]. For this reason, none of the actions will lead to a price position outside this range. This feature ensures that any consumption price published by Cold Energy will be more attractive to energy buyers, and any production price published will be more attractive to energy sellers.

The learning algorithm used was the Watkins-Dayan [9] Q-Learning update rule with an \(\varepsilon -Greedy\) exploration strategy. This strategy either selects a random action with \(\varepsilon \) probability or selects an action with \(1-\varepsilon \) probability that gives maximum reward in a given state.

$$\begin{aligned} \hat{Q}_{t}(s,a)\leftarrow (1 - \alpha _{t})\hat{Q}_{t-1}(s,a)+\alpha _{t}\left[ r_{t}+\gamma \underset{\text {a'}}{\hat{Q}_{t-1}} (s',a')\right] , \end{aligned}$$
(13)
Table 1 Competing brokers general description

4 Experimental Results

This section will describe the results obtained by using the market representation and the actions described on the previous section. Six different brokers participated on the series of experiments, including COLD Energy and ReddyLearning. The different brokers are described on Table 1.

Since COLD Energy deals with flat tariffs, it is necessary to test our broker against similar ones. For this reason the broker ReddyLearning was chosen. The same logic applies for the selection of the remaining brokers. It is not possible to tell the result of the pricing strategy apart if the tariff creation mechanisms of the competing and if the competing brokers are not publishing only flat tariffs. These two considerations are really important since Power TAC provides the capability of publishing time-dependent tariffs and also supplies wholesale market abilities to every broker.

4.1 General Setup

Prior to the experiments, both COLD Energy and ReddyLearning were trained against a fixed broker for 2,000 timeslots and against the random broker for 8,000 timeslots. During the training sessions the brokers were adjusted to explore at every decision step, updating their Q-table with the obtained reward. The trained Q-table was stored and transferred to the brokers to be exploited on the experiments. The experimental general setup includes a game length of 3000 timeslots and a tariff publication interval of 50 ± 5 timeslots when a consumption and a production tariffs are published. Lastly, since the training process took place already before the experimental session, the learning brokers did not explore at all during the test sessions.

4.2 Experiments Description

The experiments were designed to test COLD Energy against specific sets of the competing brokers and itself. We conducted the following set of experiments.

  • COLD Energy versus All: our learning broker versus Random, Balanced, Greedy and the learning broker proposed by Reddy, named as ReddyLearning.

  • COLD Energy versus ReddyLearning: our learning broker versus the learning broker proposed by Reddy.

4.3 COLD Energy Versus All

This series of experiments included all the brokers. Figure 3 plots the average and standard deviation per publication interval for each broker, while Fig. 4 is an example of how the accumulated utility behaved on one of the experiments.

Fig. 3
figure 3

Overall average and standard deviation for each broker

Fig. 4
figure 4

Accumulated utility for each broker

Several observations can be drawn from these results. First, Fig. 3 clearly show that COLD Energy has the highest utility compared to the rest of the competing brokers. The second position is for the Random broker and the third one for ReddyLearning. The latter broker uses the market representation and set of actions proposed by Reddy [8], which is different from those used by COLD Energy and Random. On the other hand, Random shares the same set of actions and the same market representation with COLD Energy, for this reason Random gets a better utility that COLD Energy sometimes, when it reacts after COLD Energy has published its tariffs. This fact highlights the importance of the proposed representation. It is important to mention that COLD Energy’s actions are market-bounded, which means that the resulting prices will be competitive, thus customers have a higher probability of deciding to subscribe to them.

Finally Table 2 provides more insight on the brokers’ behavior. The first column shows each one of the states as described by Eq. 7. The description of each abbreviation is explained in Appendix. The next columns show the average utility and standard deviation obtained by each one of the states described in column one. If we observe Table 2 we can notice first of all that, for COLD Energy, even if the overall standard deviation is high compared to the overall average (showed in the last row), there are states with higher averages and lower standard deviations compared to the other brokers. The states with larger average rewards are those when \(PS_{t}\) equals to Rational and when \(CPS_{t}\) equals Far or Very Far. This two values for \(CPS_{t}\) are associated with the inline and bottom actions, which safely place the consumption price away from the competitors, making the published tariff attractive to the customers. These states have as well some of the lowest standard deviations, which tells us that this is a consistent desirable state.

Table 2 Average and standard deviation per state and broker

4.4 COLD Energy Versus ReddyLearning

This section shows evidence of the performance of COLD Energy when it was tested against its direct competitor ReddyLearning alone. Figure 5 shows a plot with the average utility and standard deviation for this experiment.

By looking at Fig. 5, which shows the average and standard deviation per publication interval for both ReddyLearning and COLD Energy, it is evident that COLD Energy achieves better results than ReddyLearning with a very short standard deviation. The average utility on this experiment compared to Fig. 3 is higher, because there are less brokers, and for this reason, there are more customers available for each one.

5 Conclusions

The experiments showed that COLD Energy, with its proposed set of actions and its market representation was able to obtain the highest profits 70% of the evaluated timeslots when tested against all the competing brokers, including ReddyLearning. When tested only against the latter, COLD Energy was able to obtain the highest profit 100% of the evaluated timeslots. This proved that both the market representation and the proposed actions achieved a better average utility compared to that delivered by the other competing brokers against whom it was tested, namely ReddyLearning, Balanced, Greedy and Random.

Fig. 5
figure 5

Overall average and standard deviation for COLD Energy and ReddyLearning

It is important to mention as well that the market representation size is not bounded to the number of competing brokers; the number of possible value combinations of state space S will remain the same if there are 1, 2 or more competing brokers. This is very useful because it makes easier the learning process. On the other hand, the market-bounded actions proposed were the most used by COLD Energy, and these actions conducted it to lead the utility rank most of the time on the experiments executed. Even as there were some non-market-bounded actions available, such as Minmax for instance, COLD Energy learned that those actions did not yield good results, and for this reason decided not to use them.