Selected Publications
Beyond Prediction: Managing the Repercussions of Machine Learning Applications
Aline Weber*,
Blossom Metevier*
Yuriy Brun,
Philip S. Thomas,
Bruno Castro da Silva
*Equal contribution
Advances in Neural Information Processing Systems (NeurIPS 2025)
Abstract
Machine learning models are often designed to maximize a primary goal, such as accuracy. However, as these models are
increasingly used to inform decisions that affect people's lives or well-being, it is often unclear what the real-world
repercussions of their deployment might be—making it crucial to understand and manage such repercussions effectively.
Models maximizing user engagement on social media platforms, e.g., may inadvertently contribute to the spread of misinformation
and content that deepens political polarization. This issue is not limited to social media—it extends to other applications
where machine learning-informed decisions can have real-world repercussions, such as education, employment, and lending.
Existing methods addressing this issue require prior knowledge or estimates of analytical models describing the relationship
between a classifier's predictions and their corresponding repercussions. We introduce Theia, a novel classification
algorithm capable of optimizing a primary objective, such as accuracy, while providing high-confidence guarantees about its
potential repercussions. Importantly, Theia solves the open problem of providing such guarantees based solely on existing
data with observations of previous repercussions. We prove that it satisfies constraints on a model's repercussions with
high confidence and that it is guaranteed to identify a solution, if one exists, given sufficient data. We empirically
demonstrate, using real-life data, that Theia can identify models that achieve high accuracy while ensuring, with high
confidence, that constraints on their repercussions are satisfied.
Fair Continuous Resource Allocation with Learning of Impact
Blossom Metevier,
Dennis Wei,
Karthi Ramamurthy,
Philip S. Thomas
Advances in Neural Information Processing Systems (NeurIPS 2025)
Abstract
Recent works have studied fair resource allocation in social settings, where fairness is judged by the impact
of allocation decisions rather than more traditional minimum or maximum thresholds on the allocations themselves. Our
work significantly adds to this literature by developing continuous resource allocation strategies that adhere to
equality of impact, a generalization of equality of opportunity. We derive methods to maximize total welfare
across groups subject to minimal violation of equality of impact, in settings where the outcomes of allocations are
unknown but have a diminishing marginal effect. While focused on a two-group setting, our study addresses a broader
class of welfare dynamics than explored in prior work. Our contributions are three-fold. First, we provide an algorithm
designed for non-noisy continuous-resource environments that achieves sublinear fairness regret. Second, we propose a
meta-algorithm for noisy settings building on the first contribution, and third, we empirically demonstrate that our
approach consistently achieves fair, welfare-maximizing allocations.
Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees
Blossom Metevier*,
Yaswanth Chittepu*,
Will Swarzer,
Scott Niekum,
Philip S. Thomas
*Equal contribution
Reinforcement Learning Conference (RLC 2025)
Abstract | arXiv
Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness,
which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings,
we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence
safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences
into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It
then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an
intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to
verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical
analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a
user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B,
Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high
probability and can improve harmlessness and helpfulness compared to previous methods.
Analyzing the Relationship Between Difference and Ratio-Based Fairness Metrics
Min-Hsuan Yeh,
Blossom Metevier,
Austin Hoag,
Philip S. Thomas
ACM Conference on Fairness, Accountability, and Transparency (FAccT 2024)
Abstract | Paper
In research studying the fairness of machine learning algorithms and models, fairness often means that a metric is the same when computed for two different groups of people. For example, one might define fairness to mean that the false positive rate of a classifier is the same for people of different genders, ages, or races. However, it is usually not possible to make this metric identical for all groups. Instead, algorithms ensure that the metric is similar---for example, that the false positive rates are similar. Researchers usually measure this similarity or dissimilarity using either the difference or ratio between the metric values for different groups of people. Although these two approaches are known to be different, there has been little work analyzing their differences and respective benefits. In this paper we examine this relationship analytically and empirically, and conclude that unless there are application-specific reasons to prefer the difference approach, the ratio approach should be preferred.
Fairness Guarantees under Demographic Shift
Stephen Giguere,
Blossom Metevier,
Yuriy Brun,
Philip S. Thomas
International Conference on Learning Representations (ICLR 2022)
Abstract | Paper
Recent studies found that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behavior occurs. However, these approaches typically assume the data used for training is representative of what will be encountered in deployment, which is often untrue. In particular, if certain subgroups of the population become more or less probable in deployment (a phenomenon we call demographic shift), prior work’s fairness assurances are often invalid. In this paper, we consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-con- fidence behavioral guarantees that hold under demographic shift when data from the deployment environment is unavailable during training. Shifty, the first technique of its kind, demonstrates an effective strategy for designing algorithms to overcome demographic shift’s challenges. We evaluate Shifty using the UCI Adult Census dataset (Kohavi and Becker, 1996), as well as a real-world dataset of university entrance exams and subsequent student success. We show that the learned models avoid bias under demographic shift, unlike existing methods. Our experiments demonstrate that our algorithm’s high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs.
Reinforcement Learning When All Actions are Not Always Available
Yash Chandak,
Georgios Theocharous,
Blossom Metevier,
Philip S. Thomas
AAAI Conference on Artificial Intelligence (AAAI 2020)
Abstract | arXiv
The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs suffer from divergence issues, and present new algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches using several tasks inspired by real-life use cases wherein the action set is stochastic.
Offline Contextual Bandits with High Probability Fairness Guarantees
Blossom Metevier,
Stephen Giguere,
Sarah Brockman,
Ari Kobren,
Yuriy Brun,
Emma Brunskill,
Philip S. Thomas
Advances in Neural Information Processing Systems (NeurIPS 2019)
Abstract | Paper
We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Unlike previous work, our algorithm accepts multiple fairness definitions and allows users to construct their own unique fairness definitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-specified threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness definitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness definitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other offline and online contextual bandit algorithms.