TB7 - Reinforcement Learning Methods and Applications
May 12 2026 15:30 – 17:10
Location: Budapest (green)
Chaired by Erick Delage
4 Presentations
Learning Fair Policies in Major-Minor Weakly Coupled MDPs: Application to Joint Pricing and Relocation in Ride-Hailing
We study fair resource allocation in coupled Markov decision processes, applied to taxi pricing and relocation. Resource constraints link a major MDP with N minor MDPs. Using a generalized ρ-fairness instead of total-sum objectives, we show the problem reduces to permutation-invariant optimization, enabling efficient algorithms. Results are validated on synthetic and NYC real-world data.
Risk-Aware Decision Making in Restless Bandits: Theory and Algorithms for Planning and Learning
In restless bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless bandits problem with a risk-neutral objective by incorporating risk-awareness, which is particularly important in various real-world applications especially when the decision maker seeks to mitigate downside risks. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index for the first time for the planning problem with finite-horizon non-stationary and for infinite-horizon stationary Markov decision processes. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.
Online Policy Evaluation for MDPs with Dynamic UBSR Measures
Developing efficient policy evaluation methods for risk-aware reinforcement learning remains a challenging task. In this work, we propose computationally efficient algorithms for policy evaluation for MDPs with dynamic UBSR measures using linear function approximation. Specifically, we introduce a first-order UBSR-TD algorithm and a second-order UBSR-Newton algorithm for online policy evaluation, extending classical policy evaluation methods to accommodate general loss functions. We establish almost sure convergence of the UBSR-TD algorithm and provide a high-probability convergence bound for the UBSR-Newton algorithm. Our results demonstrate that incorporating an appropriate loss function into the updates of standard temporal-difference and Newton-type schemes naturally yields policy evaluation algorithms for MDPs with dynamic UBSR measures. Numerical experiments support the theoretical findings, and an application to a perishable inventory management problem highlights the practical effectiveness of the proposed methods.
Risk-Averse Optimal Execution: A Comparison of VaR and CVaR Criteria
We consider a trader who wants to liquidate a large volume of an asset over a finite horizon while adapting decisions to the level of risk and market fluctuations, aiming to achieve better performance than standard benchmarks.
The work is structured as follows: First we introduce the essential concepts, then detail the problem and the methodology, and finally present the numerical results and discusses the limitations.
The methodology is based on Markov Decision Processes (MDP), which allow the evaluation of all possible scenarios to determine the optimal decisions at each moment. A market impact model was built using properties of the limit order book, and the price dynamics were approximated using a binomial model. Risk measures, such as Value at Risk (VaR) and Conditional Value at Risk (CVaR), were then integrated to optimize the risk applied to total return, which is represented as a random variable. The results show that the VaR-based model follows a more opportunistic strategy, while the CVaR-based model favors a slower and more cautious liquidation. Both models adapt to the initial risk level throughout the execution. Compared to benchmark strategies, they demonstrate higher efficiency.
