Blossom Metevier

I am a postdoctoral fellow at Princeton University's Center for Information Technology Policy, under the guidance of Aleksandra Korolova. My research in machine learning (ML) focuses on sequential learning, with specific interests in:

  • Interactive Learning, including bandits and reinforcement learning, with applications to large language models
  • Responsible ML, with a focus on provable fairness & safety guarantees
  • Reliable ML, with a focus on robustness to challenges in data integrity.

I completed my Ph.D. at the University of Massachusetts, where I was advised by Phil Thomas in the Autonomous Learning Lab. Previously, I was a research intern at IBM, Microsoft, and Meta, where I focused on issues related to algorithmic fairness in ML. I earned my bachelor's degree in Computer Science and Mathematics from the University of Maryland Baltimore County, where I also competed as a track & field athlete.

Google Scholar | bmetevier [at] princeton [dot] edu

Blossom

Updates

⚙️ ICLR 2026 Workshop Acceptances. Our papers, "Statistical Verification of Fairness in Agentic Alignment" and "Towards Statistical Verification for Trustworthy AI," have been accepted to the Principled Design for Trustworthy AI and Algorithmic Fairness Across Alignment Procedures and Agentic Systems workshops, respectively.

⚙️ EvalEval Workshop Poster. Our poster on "Measuring Validity in LLM-based Resume Screening" has been accepted to the EvalEval workshop.

📝 IASEAI 2026 Acceptance. Our paper on "Measuring Validity in LLM-based Resume Screening" has been accepted at IASEAI.

📝 NeurIPS 2025 Acceptances. Our papers on "Fair Continuous Resource Allocation with Learning of Impact" and "Managing the Repercussions of Machine Learning Applications" have been accepted at NeurIPS.

🏢 Fall 2025 Postdoctoral Position. I started as a Postdoctoral Research Fellow at the Center for Information Technology Policy, working with Aleksandra Korolova.

Spring 2025 Ph.D. Defense. I defended my PhD in Computer Science on "Fair Algorithms for Sequential Learning Problems."

📝 RLC 2025 Acceptance. Our paper on "Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees" has been accepted at RLC.

Spring 2024 Thesis Proposal. I proposed my thesis titled "Fair Algorithms for Sequential Learning Problems."

📝 FAccT 2024 Acceptance. Our paper on Analyzing the Relationship Between Difference and Ratio-Based Fairness Metrics has been accepted at FAccT.

🏢 Fall 2022 Internship. I worked on the Responsible AI Team at Facebook AI Research.

🏢 Summer 2022 Internship. I worked with Nicolas Le Roux at MSR FATE Montréal.

📝 ICLR 2022 Acceptance. Our paper on Fairness Guarantees under Demographic Shift has been accepted at ICLR.

🏢 Summer 2021 Internship. I worked with Dennis Wei and Karthi Ramamurthy in the Trustworthy AI group at IBM.

📅 NERDS 2020 Organizer. Emma Jordan and I organized the first Northeast Reinforcement Learning and Decision Making Symposium (NERDS).

Selected Publications


Measuring Validity in LLM-based Resume Screening
Jane Castleman, Zeyu Shen, Blossom Metevier, Max Springer, Aleksandra Korolova
International Association for Safe & Ethical AI (IASEAI 2026)

Abstract | Paper

Resume screening is perceived as a particularly suitable task for LLMs, given their ability to analyze natural language; thus many individuals rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs at scale, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the fairness of decisions, we find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates, generally prioritizing historically-marginalized candidates. In conclusion, our framework provides a principled audit of LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed at scale.


Beyond Prediction: Managing the Repercussions of Machine Learning Applications
Aline Weber*, Blossom Metevier*, Yuriy Brun, Philip S. Thomas, Bruno Castro da Silva
*Equal contribution
Advances in Neural Information Processing Systems (NeurIPS 2025)

Abstract | Paper

Machine learning models are often designed to maximize a primary goal, such as accuracy. However, as these models are increasingly used to inform decisions that affect people’s lives or well-being, it is often unclear what the real-world repercussions of their deployment might be—making it crucial to understand and manage such repercussions effectively. Models maximizing user engagement on social media platforms, e.g., may inadvertently contribute to the spread of misinformation and content that deepens political polarization. This issue is not limited to social media—it extends to other applications where machine learning-informed decisions can have real-world repercussions, such as education, employment, and lending. Existing methods addressing this issue require prior knowledge or estimates of analytical models describing the relationship between a classifier’s predictions and their corresponding repercussions. We introduce THEIA, a novel classification algorithm capable of optimizing a primary objective, such as accuracy, while providing high-confidence guarantees about its potential repercussions. Importantly, THEIA solves the open problem of providing such guarantees based solely on existing data with observations of previous repercussions. We prove that it satisfies constraints on a model’s repercussions with high confidence and that it is guaranteed to identify a solution, if one exists, given sufficient data. We empirically demonstrate, using real-life data, that THEIA can identify models that achieve high accuracy while ensuring, with high confidence, that constraints on their repercussions are satisfied.


Fair Continuous Resource Allocation with Learning of Impact
Blossom Metevier, Dennis Wei, Karthi Ramamurthy, Philip S. Thomas
Advances in Neural Information Processing Systems (NeurIPS 2025)

Abstract | Paper

Recent works have studied fair resource allocation in social settings, where fairness is judged by the impact of allocation decisions rather than more traditional minimum or maximum thresholds on the allocations themselves. Our work significantly adds to this literature by developing continuous resource allocation strategies that adhere to equality of impact, a generalization of equality of opportunity. We derive methods to maximize total welfare across groups subject to minimal violation of equality of impact, in settings where the outcomes of allocations are unknown but have a diminishing marginal effect. While focused on a two-group setting, our study addresses a broader class of welfare dynamics than explored in prior work. Our contributions are three-fold. First, we provide an algorithm designed for non-noisy continuous-resource environments that achieves sublinear fairness regret. Second, we propose a meta-algorithm for noisy settings building on the first contribution, and third, we empirically demonstrate that our approach consistently achieves fair, welfare-maximizing allocations.


Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints
Blossom Metevier*, Yaswanth Chittepu*, Will Swarzer, Scott Niekum, Philip S. Thomas
*Equal contribution
Reinforcement Learning Conference (RLC 2025)

Abstract | Paper

Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness which can lead to unacceptable actions in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences regarding helpfulness and harmlessness (safety) and trains separate reward and cost models, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function while ensuring that a specific upper-confidence bound on the cost constraint is satisfied. In the second step, the trained model undergoes a safety test to verify whether its performance satisfies a separate upper-confidence bound on the cost constraint. We provide a theoretical analysis of HC-RLHF, including a proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa-3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability while also improving helpfulness and harmlessness compared to previous methods.


Analyzing the Relationship Between Difference and Ratio-Based Fairness Metrics
Min-Hsuan Yeh, Blossom Metevier, Austin Hoag, Philip S. Thomas
ACM Conference on Fairness, Accountability, and Transparency (FAccT 2024)

Abstract | Paper

In research studying the fairness of machine learning algorithms and models, fairness often means that a metric is the same when computed for two different groups of people. For example, one might define fairness to mean that the false positive rate of a classifier is the same for people of different genders, ages, or races. However, it is usually not possible to make this metric identical for all groups. Instead, algorithms ensure that the metric is similar—for example, that the false positive rates are similar. Researchers usually measure this similarity or dissimilarity using either the difference or ratio between the metric values for different groups of people. Although these two approaches are known to be different, there has been little work analyzing their differences and respective benefits. In this paper we examine this relationship analytically and empirically, and conclude that unless there are application-specific reasons to prefer the difference approach, the ratio approach should be preferred.


Fairness Guarantees under Demographic Shift
Stephen Giguere, Blossom Metevier, Yuriy Brun, Philip S. Thomas
International Conference on Learning Representations (ICLR 2022)

Abstract | Paper

Recent studies found that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behavior occurs. However, these approaches typically assume the data used for training is representative of what will be encountered in deployment, which is often untrue. In particular, if certain subgroups of the population become more or less probable in deployment (a phenomenon we call demographic shift), prior work’s fairness assurances are often invalid. In this paper, we consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-confidence behavioral guarantees that hold under demographic shift when data from the deployment environment is unavailable during training. Shifty, the first technique of its kind, demonstrates an effective strategy for designing algorithms to overcome demographic shift’s challenges. We evaluate Shifty using the UCI Adult Census dataset (Kohavi and Becker, 1996), as well as a real-world dataset of university entrance exams and subsequent student success. We show that the learned models avoid bias under demographic shift, unlike existing methods. Our experiments demonstrate that our algorithm’s high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs.


Reinforcement Learning When All Actions are Not Always Available
Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip S. Thomas
AAAI Conference on Artificial Intelligence (AAAI 2020)

Abstract | Paper

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs suffer from divergence issues, and present new algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches using several tasks inspired by real-life use cases wherein the action set is stochastic.


Offline Contextual Bandits with High Probability Fairness Guarantees
Blossom Metevier, Stephen Giguere, Sarah Brockman, Ari Kobren, Yuriy Brun, Emma Brunskill, Philip S. Thomas
Advances in Neural Information Processing Systems (NeurIPS 2019)

Abstract | Paper

We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Unlike previous work, our algorithm accepts multiple fairness definitions and allows users to construct their own unique fairness definitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-specified threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness definitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness definitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other offline and online contextual bandit algorithms.

Preprints & Current Projects


Fair Offline Contextual Bandits with Guarantees under Inferred Attributes
Blossom Metevier*, Yaswanth Chittepu*, Max Springer, Sohini Chintala, Bohdan Turbal, Scott Niekum, Aleksandra Korolova

Abstract

Although offline contextual bandits achieve strong performance in low-stakes settings, their role in automated decision-making and decision support raises serious concerns about the fairness of their behavior. Many existing strategies for mitigating unfairness rely on regularization or empirical validation, which can fail to prevent spurious constraint satisfaction under finite data. One emerging strategy instead imposes probabilistic group-fairness constraints that explicitly account for statistical uncertainty in the data-generating process and its interaction with the learning pipeline. Existing high-confidence methods, however, assume that sensitive attributes (e.g., gender or race) are observed as ground truth. In practice, these attributes are often unavailable and must be inferred. We show that naively treating inferred attributes as ground truth can invalidate high-probability fairness guarantees. We then develop methods for fair offline contextual bandits that restore valid guarantees under inferred attributes by incorporating assumptions on the reliability of the sensitive-attribute classifier. We focus on two common group fairness notions: demographic parity and reward disparity. Empirical evaluations on benchmark datasets demonstrate that our approach achieves competitive reward while satisfying the fairness guarantees.

In Submission


The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

Abstract | Paper

Fine-tuning aligned language models on entirely benign tasks (e.g. math tutoring) unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show that this orthogonality is structurally unstable and collapses under the very dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first- order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions—an effect invisible to all existing defenses. We formalize this mechanism through the Alignment Instability Condition (AIC), three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety- critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning of null-space projections, gradient filtering, and first-order constraints address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

In Submission


Matched Pair Calibration for Ranking Fairness
Hannah Korevaar, Chris McConnell, Edmund Tong, Erik Brinkman, Alana Shine, Misam Abbas, Blossom Metevier, Sam Corbett-Davies, Khalid El-Arini

Abstract | arXiv

We propose a test of fairness in score-based ranking systems called matched pair calibration. Our approach constructs a set of matched item pairs with minimal confounding differences between subgroups before computing an appropriate measure of ranking error over the set. The matching step ensures that we compare subgroup outcomes between identically scored items so that measured performance differences directly imply unfairness in subgroup-level exposures. We show how our approach generalizes the fairness intuitions of calibration from a binary classification setting to ranking and connect our approach to other proposals for ranking fairness measures. Moreover, our strategy shows how the logic of marginal outcome tests extends to cases where the analyst has access to model scores. Lastly, we provide an example of applying matched pair calibration to a real-word ranking data set to demonstrate its efficacy in detecting ranking bias.

Personal

Apart from the academic grind, I enjoy running, weightlifting, and reading. I’m a fan of the DC Universe, especially the Teen Titans, and I follow a number of Japanese comics. I also love spending time with my cats (featured here) and my dog!