Abstract
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction, Cambridge, USA: The MIT Press, 1998.
D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming, Athena Scientific: Belmont, USA, 1996.
P. Balakrishna, R. Ganesan, L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxiout times: A case-study of Tampa bay departures. Transportation Research Part C: Emerging Technologies, vol. 18, no. 6, pp. 950–962, 2010.
Z. Sui, A. Gosavi, L. Lin. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Engineering Management Journal, vol. 22, no. 4, pp. 44–53, 2010.
P. Abbeel, A. Coates, T. Hunter, A. Y. Ng. Autonomous autorotation of an RC helicopter. Experimental Robotics, O. Khatib, V. Kumar, G. J. Pappas, Eds., Berlin Heidelberg, Germany: Springer, pp. 385–394, 2009.
R. A. Howard, J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, vol. 18, no. 7, pp. 356–369, 1972.
M. Rabin. Risk aversion and expected-utility theory: A calibration theorem. Econometrica, vol. 68, no. 5, pp. 1281–1292, 2000.
P. Whittle. Risk-sensitive Optimal Control, NY, USA: John Wiley, 1990.
J. A. Filar, L. C. M. Kallenberg, H. M. Lee. Variancepenalized Markov decision processes. Mathematics of Operations Research, vol. 14, no. 1, pp. 147–161, 1989.
M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.
M. Sato, S. Kobayashi. Average-reward reinforcement learning for variance penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 473–480, 2001.
A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, vol. 42, no. 8, pp. 1321–1330, 2006.
A. Gosavi. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques. International Journal of General Systems, vol. 43, no. 6, pp. 649–669, 2014.
A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of Winter Simulation Conference, IEEE, Austin, USA, pp. 373–379, 2009.
O. Mihatsch, R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, vol. 49, no. 2–3, pp. 267–290, 2002.
P. Geibel. Reinforcement learning via bounded risk. In Proceedings of Internation Conference on Machine Learning, Morgan Kaufman, pp. 373–379, 2009.
M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th International Machine Learning Conference, Bellevue, USA, pp. 162–169, 2001.
Y. Chen, J. H. Jin. Cost-variability-sensitive preventive maintenance considering management risk. IIE Transactions, vol. 35, no. 12, pp. 1091–1102, 2003.
C. Barz, K. H. Waldmann. Risk-sensitive capacity control in revenue management. Mathematical Methods of Operations Research, vol. 65, no. 3, pp. 565–579, 2007.
K. J. Chung, M. J. Sobel. Discounted MDPs: Distribution functions and exponential utility maximization. SIAM Journal of Control and Optimization, vol. 25, no. 1, pp. 49–62, 1987.
M. Bouakiz, Y. Kebir. Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, vol. 86, no. 1, pp. 1–15, 1995.
C. B. Wu, Y. L. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, vol. 231, no. 1, pp. 47–67, 1999.
A. Gosavi. Target-sensitive control of Markov and semi- Markov processes. International Journal of Control, Automation and Systems, vol. 9, no. 5, pp. 941–951, 2011.
A. A. Gosavi, S. K. Das, S. L. Murray. Beyond exponential utility functions: A variance-adjusted approach for riskaverse reinforcement learning. In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, Orlando, USA, pp. 1–8, 2014.
D. P. Bertsekas. Dynamic Programming and Optimal Control, USA: Athena, 1995.
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, USA: John Wiley & Sons, Inc., 1994.
T. K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, vol. 45, no. 4, pp. 560–574, 1999.
J. Baxter, P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, vol. 15, pp. 319–350, 2001.
A. Parulekar. A Downside Risk Criterion for Preventive Maintenance, Master dissertation, University at Buffalo, The State University of New York, 2006.
T. K. Das, S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions, vol. 31, no. 6, pp. 537–551, 1999.
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Guest Editor Dong-Ling Xu
Abhijit Gosavi received the B.Eng. degree in mechanical engineering from the Jadavpur University, India in 1992, and the M. Eng. degree in mechanical engineering from the Indian Institute of Technology, Madras, India in 1995. He received the Ph.D. degree in industrial engineering from University of South Florida in 1999. Currently, he is an associate professor in Department of Engineering Management and Systems Engineering at Missouri University of Science and Technology. He has published more than 60 refereed journal and conference papers. His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management. He has received research funding awards from the National Science Foundation of the United Stated of America and numerous other agencies. He is a member of IIE, ASEM, and INFORMS.
His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management.
ORCID ID: 0000-0002-9703-4076
Anish Parulekar received the B.Eng. degree in mechanical engineering from University of Mumbai, India in 2004, and the M. Sc. degree in industrial engineering from Department of Industrial and Systems Engineering at the University of Buffalo, State University of New York, USA in 2006. He currently serves as a deputy vice president and the head of Marketing Analytics in Axis Bank, Mumbai, India.
His research interests include risk, computing, and Markov control.
Rights and permissions
About this article
Cite this article
Gosavi, A., Parulekar, A. Solving Markov decision processes with downside risk adjustment. Int. J. Autom. Comput. 13, 235–245 (2016). https://doi.org/10.1007/s11633-016-1005-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-016-1005-3