Keywords

1 Introduction

In 2023, wind and solar energy represented 14.26% of global electricity generation, after these shares doubled in 5 preceding years [42]. The power of wind and sunlight reaching the Earth’s surface is, to some extent, random. Therefore, while the rise of renewable energy sources presents the prospect of cheap and clean energy, it also exacerbates the problem of balancing power supply and demand.

In many countries, the main institution that balances volatile electricity supply and demand is a day-ahead energy market [13, 14, 27, 30]. Every day, agents participating in this market place their buy and sell bids separately for every hour between 0 am and 11 pm the next day. Market clearing prices are then designated for each of these hours, and the bids are consequently executed or not, depending on the proposed prices.

Here, we consider an agent that (i) consumes electricity, (ii) produces electricity, and (iii) has electricity storage. What is of main interest here is a strategy for automated trading on a day-ahead energy market on behalf of this agent.

Reinforcement learning (RL) [32] is a natural tool to optimize a policy of sequential decision-making in dynamical, stochastic systems that elude modeling. RL has been applied to optimize strategies of on-line energy trading within local energy markets [4, 15, 20, 21, 24, 28], real-time bidding for internet ads [6], stock market trading [10, 18, 38, 41], power grid control [1, 14, 25], trading on the day-ahead energy market [8, 9].

In existing studies on RL for automated trading, an action either selects a bid from a predefined set or directly defines parameters (type, price, and quantity) of a single bid or a pair (sell and buy) of bids.

The fact that for each bidding, the agent is able to submit only one or two bids is a serious limitation. Most electronic markets allow their participants to define many bids for each time interval. By submitting a collection of bids, the participant can define how much of the commodity he wishes to sell and/or buy, depending on the market price. The actual trading agents usually take advantage of this possibility since buying more when the price is low and selling more when the price is high usually results from economic rationality.

In this paper, we design a strategy that translates the information available to the trading agent into parameters of the supply and demand curves. These parameters are then translated into a collection of bids. The number of bids within the collection is variable. This strategy enables the trading agent to behave rationally in an economic sense, which is not possible when the strategy only produces single bids. We have designed our strategy with the day-ahead electricity market. However, it can also be applied to other electronic markets.

In this paper, we demonstrate the performance of our proposed automated trading strategy in several real data-based scenarios of the day-ahead electricity market trading. The strategy is currently being deployed in a real system for energy storage management.

The paper contributes as follows:

  • We design a parametric automated trading strategy suitable for electronic markets with significant lags between bidding and its corresponding transaction. This strategy produces supply and demand curves by means of bid collections of variable sizes, thereby enabling the trading agent to behave rationally.

  • We formalize a framework in which on-line RL can be applied to optimize a policy on the basis of recorded observations of the external environment without data on earlier decision-making.

  • We apply reinforcement learning to optimize the above strategy and select the best algorithm for this purpose. The resulting strategy is fitted to the data and ready to use in real life.

2 Related Work

Automated Trading on the Electricity Market. Research on automated trading on the electricity market covers various approaches. Some works introduce theoretical frameworks of bidding strategies [5, 17, 36]. Many authors propose various forms of parametric bidding strategies. These strategies are optimized with methods like linear programming [3], genetic and evolutionary algorithms [2, 37] or stochastic optimization [13, 19]. However, as a more complex bidding strategy is expected and a more complex transformation of observations into bids is required, these techniques become less effective.

With the advent of electricity prosumers, energy microgrids, energy cooperatives, and flexible price-driven energy consumption, there is an increasing need for automated decision-making and control in various activities undertaken by the energy market participants. Strategies for these agents can be optimized with reinforcement learning. Various applications of RL in power systems are reviewed in [14, 26, 39]. The authors of [23] analyze bidding on a DA energy market as a zero-sum stochastic game played by energy producers willing to exercise their market power and keep their generators productive. RL is used there to optimize their bidding strategy. In [35], bidding on a DA energy market from the point of view of a flexible buyer (who charges a fleet of electric vehicles) is analyzed. His strategy is optimized with RL. A number of papers is devoted to peer-to-peer trading with electricity on a local, event-driven energy market, with RL applied to optimize the behavior of such peers [4, 7, 8, 15, 28]. RL and neural price predictions are used in [20] to optimize the scheduling of home appliances of private users. The authors assume that the electricity prices are changing and are known one hour ahead. The work [4] analyzes a similar setting in which the users also trade energy with each other. This setting is used in [28] to optimize the user strategies with multi-agent RL. The authors of [21] optimize peer-to-peer energy microgrid operations with multi-agent reinforcement learning, with their method generating higher net profits than simple fixed price biddings. Q-Learning and SARSA algorithms are used in [24] to create simple bidding strategies and test them on German real-life data.

The authors of [9] consider simultaneous trading on a DA and hour-ahead energy markets by an energy storage operator as a Markov Decision Process (MDP). The authors use RL to optimize a strategy of bidding on a DA energy market by a battery energy storage system. They use RL to optimize a strategy of bidding on a DA energy market by a battery energy storage system (BESS). However, the authors address the dynamics of that process only to a limited extent. Consecutive days are separate episodes, so the between-day dynamics of the market are not accounted for. Discrete actions define the parameters of the bids. They are not based on external observations such as weather forecasts. Also, only a single bid can be placed each hour. In the current paper, we address all of these limitations, which leads to significantly better performance of our proposed strategy and allows it to be deployed in real-life scenarios.

Automated Stock Market Trading. In this area, the trading agent observes a set of time series of prices of different assets. The agent makes on-line decisions on buying these assets at the current prices in anticipation of their price increase or selling them in anticipation of their price decrease. Because the problem is formalized as an MDP, it is addressed with RL [10, 40].

Additional related works are discussed in Appendix A of the supplementary material.

3 Problem Definition

In this paper, we consider automated trading on the commodity markets with lags between biddings and their corresponding transactions. We specifically focus on the day-ahead energy market, understanding that other commodity markets could be approached alike, with some minor variations.

3.1 Day-Ahead Electricity Market

A trading agent is an entity such as a small- or medium-sized consumer of electricity e.g., a group of households connected together to the power network. We assume that it may consume electricity randomly, produce electricity with weather-dependent sources such as solar panels and windmills, and store energy in batteries.

The trading agent participates in the day-ahead energy market. Every day before 10.30 amFootnote 1 the agent submits bids for 24 separate biddings: for hours 0 am, 1 am, ..., 11 pm of the following day. Each bid is defined by the hour, type (sell/buy), price (per 1 kWh), and quantity (in kWhs). Any number of bids for each hour is acceptable. Right after the biddings close at 10.30 am, market prices are designated for each hour. The buy bids with prices higher than or equal to the market price will be executed at the market price. Likewise, the sell bids with prices lower than or equal to the market price will be executed at the market price. On the next day, at each hour, the agent consumes, produces, and transmits the energy to/from the power network according to its bids being executed. The net energy is transmitted to or released from the energy storage. When the agent tries to get energy from empty storage or put the energy into full storage, it actually exchanges it with the market and pays a special fine for that.

The problem is to designate the bids on behalf of the trading agent to maximize the profit gained (or minimize the cost incurred) from participation in the market.

3.2 Reinforcement Learning to Bid

We adopt the general framework of reinforcement learning [32]. The objective is to optimize a policy that translates relevant available information into bids. The said information defines the state of the environment. It is relevant for future market prices, e.g., weather forecasts or the day of the week. Also, it is relevant to the current situation of the trading agent and its potential to produce and consume energy, e.g., battery charge and, again, weather forecasts.

Every day, the trading agent is receives a reward equal to the financial net result of its bids (and fines). The goal is to optimize the policy to yield the largest possible sums of future discounted rewards in each environmental state the trading agent encounters.

4 Method

4.1 Analysis

Within traditional microeconomics, we analyze the relation between the amount of goods the agent sells or buys and the unit price of these goods. If the agent is only able to express its offered supply and demand in a pair of bids, the agent either sells/buys its defined quantity or not, depending on whether the market price is lower/higher than its defined threshold. The supply/demand curves that visualize these relations can be seen in the top part of Fig. 1. To the best of our knowledge, placing a single bid, or a sell-and-buy pair of bids, at a time has only been considered in the literature of automated trading.

However, it is folklore of microeconomics [16] that a rational economic agent is most often willing to sell a higher quantity of commodity when its market price is higher. Also, the economic agent most likely is willing to buy a higher quantity of commodity when its market price is lower. For our considered trading agent, both the above cases create a lucrative opportunity to sell high and buy cheap. These typical preferences are depicted in the middle part of Fig. 1, in the form of increasing supply curve and decreasing demand curve. How can the trading agent express such preferences with bids?

Fig. 1.
figure 1

Top: Supply and demand defined by a pair of bids; the agents sells \(q_s\) units at the unit price of \(p_m\). Middle: Nondecreasing supply and nonincreasing demand. Bottom: Nondecreasing supply and nonincreasing demand as defined by a collection of bids.

4.2 Price-Dependent Supply and Demand in Bids

Let us consider, for a given hour h, a collection of sell bids

$$\begin{aligned} \langle \text {sell}, h, p_s^{h,i}, q_0 \rangle , \quad i=1,\dots ,n^h_s, \quad p_s^{h,i}\le p_s^{h,i+1}, \end{aligned}$$
(1)

where \(q_0>0\) is a certain constant quantity, \(n^h_s\) is the number of bids, and \(p_s^{h,i}\) are unit prices. Let \(p^h_m\) be a market price, and integer j be such that

$$\begin{aligned} p_s^{h,j} \le p^h_m < p_s^{h,j+1}. \end{aligned}$$
(2)

Then, only the first j bids are executed and the bidding agent sells a quantity of \(jq_0\) at the market price \(p^h_m\). The above collection of bids (1) can thus be represented as a nondecreasing supply curve, similar to that depicted on the left-bottom part of Fig. 1.

Any nondecreasing function can be approximated by a piecewise constant step function. Consequently, any reasonable preferences of selling can be approximately represented by the collection of bids (1). Moreover, for technical reasons, in most electronic markets, quantities can only be defined in bids as integer numbers (or as integer multiples of the minimum tradable quantity). Consequently, any supply curves feasible in the electronic market is a piecewise constant step function, and it can be represented in the form (1).

The above reasoning can be repeated, with similar conclusions, for demand. It can effectively be represented as a collection of bids in the form

$$\begin{aligned} \langle \text {buy}, h, p_d^{h,i}, q_0 \rangle , \quad i=1,\dots ,n^h_d, \quad p_d^{h,i}\ge p_d^{h,i+1}, \end{aligned}$$
(3)

where \(n^h_d\) is the number of bids, and \(p_d^{h,i}\) are unit prices.

4.3 Parametric Representation of a Collection of Bids

In order to apply reinforcement learning to learn to designate collections of bids in the form (1) and (3), we need a way to translate vectors of predefined dimension into bid collections of variable size. We design this translation as follows. Let the action space be 100-dimensional, \(a\in [-1,1]^{100}\). Coordinates of a single action define all bids for the whole day. The collection of sell bids for the hour \(h=0,\dots ,23\) is given by (1) with

$$\begin{aligned} n^h_s & = \lfloor c_q \exp (c_e a_{h}) / q_0 + 1/2\rfloor \end{aligned}$$
(4)
$$\begin{aligned} p^{h,i}_s & = c^h_p \exp (a_{h+24}) \bigg (1 + \exp (a_{96}) \left( -(2a_{98}+4)^{-1}+(i/n^h_s)^{2a_{98}+3}\right) \bigg ) \end{aligned}$$
(5)

The collection of buy bids for the hour \(h=0,\dots ,23\) is given by (3) with

$$\begin{aligned} n^h_d & = \lfloor c_q \exp (c_e a_{h+48}) / q_0 + 1/2\rfloor \end{aligned}$$
(6)
$$\begin{aligned} p^{h,i}_d & = c^h_p \exp (a_{h+72}) \bigg (1 + \exp (a_{97}) \left( (2a_{99}+4)^{-1}-(i/n^h_d)^{2a_{99}+3}\right) \bigg ) \end{aligned}$$
(7)

where \(a_k\) denotes k-th coordinate of the action a, and

  • \(a_h\)/\(a_{h+48}\) defines the width of the supply/demand curve, i.e., the number of sell/buy bids for the hour h,

  • \(a_{h+24}\)/\(a_{h+72}\) defines the average height at which the supply/demand curve is located,

  • \(a_{h+24} + a_{96}\)/\(a_{h+72}+a_{97}\) defines vertical span of the supply/demand curve,

  • \(a_{98}\)/\(a_{99}\) defines convexity/concavity of the supply/demand curve,

  • \(c_q\)—quantity scaling factor (we assume its value equal to the maximum hourly production of the installed sources),

  • \(c^h_p\)—price scaling factor (we assume its value equal to the median price for hour h over the last 28 days),

  • \(c_e\)—quantity exponent scaling factor (we assume \(c_e=3\)).

The resulting supply and demand curves are depicted in Fig. 2. Note that the above symbols, except \(q_0,c_q,c_e\), depend on t, but we skip this dependence in the notation.

The supply and demand curves above are designed symmetrically. Thus, let us only analyze \(p^{h,i}_s\) (5). The term

$$\begin{aligned} -(2a_{98}+4)^{-1}+(i/n^h_d)^{2a_{98}+3} \end{aligned}$$
(8)

makes the supply curve an increasing power function with the exponent \(2a_{98}+3\) controlling the convexity/concavity of the curve; for \(a_{98}\in [-1,1]\) the exponent is in the [1, 5] interval. The component \(-(2a_{98}+4)^{-1}\) makes the average of (8) over \(i\in [0,n^h_d]\) equal to zero. The term \(\exp (a_{96})\) controls a vertical span of the supply curve. The values of \(a_{96}\) and \(a_{98}\) do not impact the average height at which the supply curve is located, which is designated only by the term \(c^h_p\exp (a_{h+24})\).

The widths and vertical locations of the curves are specified separately for different hours by their corresponding action coordinates. However, the vertical span of these curves and their convexity/concavity are specified for all hours by the same action coordinates \(a_{96}\dots a_{99}\). This parameter sharing is intended to maintain a low enough dimensionality of the action space.

Fig. 2.
figure 2

Supply and demand defined by our proposed collections of bids.

4.4 Bidding Policy

In general in reinforcement learning, a policy, \(\pi \), is a probability distribution of actions conditioned on states:

$$\begin{aligned} a_t \sim \pi (\cdot |s_t), \end{aligned}$$
(9)

where \(s_t\) and \(a_t\) are, respectively, the state and the action at the instant t of discrete time.Footnote 2 We adopt a policy in the form

$$\begin{aligned} a_t = g^1(s_t; \theta ) + \xi _t\circ \exp (g^2(s_t; \theta )), \;\; \xi _t \sim \mathcal {N}(0, I), \end{aligned}$$
(10)

where \(g^1\) and \(g^2\) are two vectors produced by the g neural network which is fed with the state \(s_t\) and parameterized by the vector \(\theta \) of trained weights; “\(\circ \)” denotes the Hadamard (elementwise) product; \(\xi _t\) denotes random normal noise.

4.5 Bidding Policy Optimization with Reinforcement Learning

Participation in the day-ahead market can be naturally modeled as a Markov Decision Process in which the state, \(s_t\), of the environment at time \(t=1,2,\dots \) is a vector composed of two sub-vectors, uncontrollable variables \(s^u_t\), and controllable variables \(s^c_t\). The uncontrollable state variables denote external conditions like weather forecasts. They evolve according to an unknown stationary conditional probability

$$\begin{aligned} s^u_{t+1} \sim P(\cdot | s^u_t). \end{aligned}$$
(11)

The controllable variables \(s^c_t\) are directly determined by the actions \(a_t\) taken and the uncontrollable state coordinates that is

$$\begin{aligned} s^c_{t+1} = f(s^c_t, a_t, s^u_t, s^u_{t+1}), \end{aligned}$$
(12)

where f is known. The key controllable state variable is the power storage charge. It trivially results from the agent’s bids (actions) and uncontrollable variables: market prices and the agent’s own energy production and consumption.

The critical assumption that allows us to distinguish uncontrollable and controllable variables is that the trading agent is small enough not to impact the market prices. Therefore, we may simulate its bidding and determine whether the bids are executed based on the recorded market prices. If the agent was large enough to actually impact the market prices, then this simulation would not be realistic, at least without an elaborate model of the impact of this agent on the market prices.

Note that the above-defined division of state variables into controllable and uncontrollable is unusual. In a typical MDP, we assume that the state changes according to

$$\begin{aligned} s_{t+1} \sim P_s(\cdot | s_t, a_t), \end{aligned}$$
(13)

where the conditional probability \(P_s\) may be quite difficult to analyze and estimate. Therefore, a strategy of choosing actions cannot be evaluated without bias within a simulation based on a model of \(P_s\).

Based on a recorded trajectory of uncontrollable states, \((s^u_t: t=1,\dots ,T)\), we can designate a strategy of selecting actions \(a_t\) based on states \(s_t\) and evaluate this strategy in a simulation with the record \((s^u_t: t=1,\dots ,T)\) replayed. This valuation will be an unbiased estimate of the performance of this strategy deployed in reality. Furthermore, we can replay this record repeatedly and simulate episodes of on-line RL just using f (12) to designate consecutive values of \(s^c_t\).

In order to optimize the strategy (10), we may use any algorithm of on-line reinforcement learning [33] e.g., A2C [22], PPO [31] or SAC [11]. In the experiments below, we used the A2C algorithm, which showed the best stability by far. Our comparison of RL algorithms is presented in Appendix G of the supplementary material. A training consists of a sequence of simulated trials in which the trajectory of uncontrollable states is just replayed from the data, and the corresponding trajectory of controllable states is designated based on the uncontrollable states, the actions selected, and the function f (12).

4.6 Alternative Bidding Strategies

In order to verify our proposed bidding strategy, we compare it to two more intuitive ones.

Simple Arbitrage Strategy. Perhaps the simplest conceivable bidding strategy is to buy energy when it is cheap, keep it in the battery, and sell it when it is expensive. On most days, the market value of electricity is the lowest at 2 am, and it is the highest at 10 am. Therefore, our reference simple arbitrage strategy assumes placing the two bids:

$$\begin{aligned} \langle \text {buy}, 2 am, +\infty , \theta _1-\widehat{l} \rangle , \quad \langle \text {sell}, 10 am, -\infty , \theta _2 \rangle , \end{aligned}$$
(14)

where \(\widehat{l}\) is an estimated storage state of charge at 0 am, and \(\theta _1\), \(\theta _2\) are optimized parameters. We apply the CMA-ES evolutionary algorithm [12] for their optimization.

Pair of Bids Strategy. A simple approach to bidding on the day-ahead electricity market, which also involves reinforcement learning, is to present just two bids for each hour \(h=0am, \dots , 11pm\), namely

$$\begin{aligned} \langle \text {buy}, h, p^h_d, n^h_d q_0 \rangle , \quad \langle \text {sell}, h, p^h_s, n^h_s q_0 \rangle , \end{aligned}$$
(15)

where \(p^h_d\), \(n^h_d\), \(p^h_s\) and \(n^h_s\) are defined by an action, \(a\in [-1,1]^{96}\), as follows:

$$\begin{aligned} n^h_d & = \lfloor c_q \exp (c_e a_{h+48}) / q_0 + 1/2\rfloor , & p^{h,i}_d = c^h_p \exp (a_{h+72}), \end{aligned}$$
(16)
$$\begin{aligned} n^h_s & = \lfloor c_q \exp (c_e a_{h}) / q_0 + 1/2\rfloor , & p^{h,i}_s = c^h_p \exp (a_{h+24}). \end{aligned}$$
(17)

For comparison, see \(n^h_d\) (6), \(p^{h,i}_d\) (7), \(n^h_s\) (4), \(p^{h,i}_s\) (5). The collection of bids strategy introduced in Sect. 4.2 would be equivalent to (16) and (17), if all buy bids for a given hour had equal price and all sell bids for a given hour had equal price. In our simulations, we use the same reinforcement learning setup to train strategies that place the above pairs of bids and the collections of bids introduced in Sect. 4.3.

5 Simulations

5.1 Simulation Environment

Experiments are conducted using a custom environment simulating day-ahead energy market operations. This simulator is based on real-life data from the Polish market. It allows for customization of various market settings, such as a bid creation time, a scale of the trading agent (defined by the number of households), or its solar and wind energy generation capabilities. The environment is based on the Gymnasium environment interface [34], making it compatible with popular reinforcement learning libraries, including Stable-Baselines3 [29], which we use as our source of RL algorithms.

We provide details and parameters on the simulation environment, the trading agent’s energy consumption and production profile, and weather forecast randomization in Appendices B–E of the supplementary material.

We run our experiments by replaying the events that occurred in the years 2016–2019. We selected this period as preceding the COVID-19 pandemic, which destabilized markets. The runs involve replaying original price data and weather data. In order to diversify every replay and thus avoid overfitting to the data, we randomize weather forecasts and electricity demand according to their statistical profile.

During the simulation, the trading agent may be forced to buy missing energy or sell excess energy immediately. It happens when the agent sells or uses energy it does not have or buys energy it does not have room for. The agent is being penalized for such events. Immediate buying is realized for double the current market price, and immediate selling is realized for half the current market price so that the agent has the incentive to better plan its bids instead of relying on instant buys or sells. Also, we do not include market entry and transaction fees, as they are fixed costs independent of the bidding strategy.

5.2 Experiments

Reinforcement learning is used to optimize the bidding policy for a collection of bids parameterized as in Sect. 4.3, later referred to as COLLECTION. It utilizes data from 2016 to the third quarter of 2018 as the training set, data from the fourth quarter of 2018 as the validation set, and data from 2019 as the testing set. The training is done in randomly generated intervals from the training set, which are 90 days long. Periodically, evaluation is done on a single validation interval 90 days long. After the training timesteps budget is depleted, the model for which the highest reward on validation interval was achieved is evaluated on the single testing interval 365 days long. Common parameters used for the RL experiments are available in Table 2 of the supplementary material.

The observation of the environment’s state (117 values) is passed to the agent at bid placing time and contains the following information:

  • prices of energy at the current day for every hour (24 values) – these are the prices for the current day, for which the bids were created the day before; the agent does not know energy prices for the bids currently submitted,

  • current relative battery charge (1 value),

  • estimated relative battery charge at midnight (1 value),

  • one-hot encoded information about the current month (12 values),

  • one-hot encoded information about the current day of the week (7 values),

  • cloudiness, wind speed, and temperature forecasts for each hour of the next day (72 values).

Rewards are computed as

$$\begin{aligned} r_t = 10^{-3}\left( p_t - \bar{p}_t - \rho _t\right) , \end{aligned}$$
(18)

where \(p_t\) is the daily profit from selling and buying energy, \(\bar{p}_t\) is a reference profit, and \(\rho _t\) is a regularizing penalty. The reference profit \(\bar{p}_t\) is a daily profit that would be achieved if the difference between daily produced and consumed energy was sold or bought at the average market price from that day. The reference profit is not trivial to achieve since the agent mostly consumes energy when it is expensive and produces energy when it is cheap. The regularizing penalty

$$\begin{aligned} \rho _t = \sum _{i=1}^{\dim (a_{t})}[|a_{t,i}|>0.99] \end{aligned}$$
(19)

where [condition] equals 1 if the condition is true, else 0, prevents the action coordinates from saturating at their bounds. The effect of regularization on the performance of tested strategies is presented and discussed in Appendix I of the supplementary material.

We compare the collection of bids strategy to the strategy to the alternative strategies presented in Sect. 4.6. The simple arbitrage strategy is later referred to as ARBITRAGE, and the pair of bids strategy is later referred to as PAIR.

We also applied the algorithm from [9], later referred to as FARL, which is a conceptually different approach to optimize a bidding strategy. FARL considers each day a 24-step episode and places a single sell/buy bid at each hour. FARL is based on the assumption that each bid is placed when market prices for preceding hours are known. This assumption is wrong for any day-ahead electricity market we are aware of. We used this algorithm to produce bids for consecutive hours without access to the market prices of previous biddings. We fed it with the same training, evaluation, and test data as discussed above. However, when used this way, it was unable to produce even remotely reasonable strategy. Implementation details, parameters, and discussion about the FARL algorithm are provided in Appendix F of the supplementary material.

5.3 Different Operation Scenarios

We tested the proposed collection of bids strategy in comparison to the alternatives from Sect. 4.6 in the following scenarios:

  • an agent has an energy storage only (BES),

  • an agent has an energy storage and production capabilities (BES+PROD),

  • an agent has an energy storage and consumes energy (BES+CON),

  • an agent has an energy storage, produces and consumes electricity (ALL).

5.4 Results

Table 1. Differences between achieved balances and the reference profit for the tested strategies in different scenarios; last column contains the reference.

Table 1 presents differences between total profits achieved by tested strategies and the total reference profit described above; the last column contains the reference. It is seen that depending on the scenario, the reference varies a lot because the trading agent either sells the energy produced, buys the energy consumed or does both or neither. The proposed collection of bids strategy achieved the best profits, beating the pair of bids strategy in all tested scenarios. The pair of bids strategy achieved reasonable results but slightly worse than the proposed strategy.

Of all tested scenarios, the collection of bids achieved the best advantage over the pair of bids strategy in the battery-only scenario. Here, the agent earns money solely based on bids created, without any production or consumption to include in the bids. It is noticeable that the collection of bids strategy is able to adapt to these circumstances, making the biggest buys when the energy price is low and the biggest sells when the energy price is high, with some additional smaller transactions also happening in beneficial hours. This means that the collection of bids strategy is able to recognize significant price fluctuations, allowing it to capitalize on occasional prices.

In all of the tested scenarios, both strategies were able to adapt to the circumstances, buying enough energy when only consumption was active and selling surpluses of energy when only production was active. Immediate transactions due to lack or excess of energy were, in fact, very rare.

Fig. 3.
figure 3

Mean hourly relative battery charge. Strategy: COLLECTION. Scenario: ALL.

In Fig. 3, the mean hourly relative charge for the battery is presented. These were calculated for the COLLECTION strategy based on the test run that achieved the best profit. The proposed strategy is able to make the best use of its available capacities, with smooth transitions between hours, indicative of reasonable bid creation. It is seen that the battery is charged at night, which means that the agent buys energy when it is cheap. The battery is discharged at about 10 am, which means that the agent sells energy when it is the most expensive. The PAIR strategy is generally able to leverage that regularity and achieve reasonable profits. However, our proposed COLLECTION strategy is also able to leverage unpredictable variations of prices to the agent’s benefit: It buys more when the prices are unexpectedly low and sells more when the prices are unexpectedly high.

The supplementary material attached to this paper contains the following:

  • Appendix A - additional related works

  • Appendix B - details and parameters of the simulation environment

  • Appendix C - model of the trading agent’s energy consumption

  • Appendix D - model of the trading agent’s energy production

  • Appendix E - model for creating weather forecasts from real weather data

  • Appendix F - description of adapting the FARL algorithm [9] to our simulation environment

  • Appendix G - comparison of other RL algorithms (PPO, SAC, TD3) together with their hyperparameters

  • Appendix H - detailed results for the pair of bids strategy

  • Appendix I - study of using regularization in the collection of bids and the pair of bids strategies

  • Plots for the collection of bids and the pair of bids strategies with different scenarios.

6 Conclusions

In this paper, we have proposed a parametrization of supply and demand curves, which allows for multiple sell and buy bids at each time, thus introducing increased flexibility and efficiency to automated trading on electronic markets. We have described a framework for optimization of this parametrized bidding strategy on a day-ahead energy market based on simulations and real-life data. We have used reinforcement learning to optimize this strategy and have compared it with different strategies. The proposed collection of bids strategy achieved the best results, getting the highest financial profit while showing reasonable behavior with battery management and bid placement.

The proposed strategy’s generality and adaptability to data allow it to be deployed in real life. Indeed, the strategy is now being deployed in a system for energy storage management.