Weekly Curated Selection

Editor's Choice

The most significant AI safety papers, curated weekly.

2
Weeks
20
Top Picks

The turn of the year brings a rigorous focus to the structural foundations of model reliability. This week's selection emphasizes theoretical advances in abstention, trust regions, and reward modeling that enable AI systems to operate safely under uncertainty. From multi-expert deferral to interpretable alignment via sparse autoencoders, these papers address the critical question of how to build systems that know their limits.

1

Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

Anqi Mao

Why it matters: This work provides the theoretical foundation and practical algorithms for AI systems to safely 'know what they don't know' by rigorously deferring difficult tasks to experts or abstaining from high-risk predictions.

Establishes $H$-consistent surrogate losses for multi-class abstention and multi-expert deferral. These formal guarantees enable reliable routing of uncertain inputs to specialized experts, mitigating hallucinations and ensuring robust performance

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 10
Alignment Theory
2

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu +3 more

Why it matters: TRM provides the first theoretically grounded method to ensure stable reinforcement learning for long-horizon LLM tasks by replacing ineffective token-level clipping with sequence-level trust region guarantees.

}Derives $O(T)$ trust region bounds for LLM-RL by identifying $D_{kl}^{tok,max}$ as the critical error driver. Proposes Trust Region Masking

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 9
Alignment Theory
3

SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum, Binyuan Hui, Jiawei Chen +6 more

Why it matters: SWE-RM provides a robust, execution-free reward model for software agents, solving the critical gap between test-time selection and reinforcement learning performance.

SWE-RM introduces a 30B MoE reward model for execution-free feedback, bypassing the sparsity and safety risks of unit-test-based RL. By optimizing for calibration and classification

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 9
RLHF Agent Safety
4

Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Yuwen Li, Wei Zhang, Zelong Huang +8 more

Why it matters: InfTool establishes a fully autonomous, self-evolving pipeline for tool-use mastery, proving that mid-sized models can rival frontier models through multi-agent synthetic data and reinforcement learning.

InfTool scales tool-use reliability via multi-agent synthesis and GRPO, generating verified trajectories from raw API specs. This closed-loop framework enables autonomous agents to master complex workflows without human labels,

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 9
Agent Safety RLHF
5

Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang, Qingsen Ma, Yuhu Shang +5 more

Why it matters: This paper bridges mechanistic interpretability and parameter-efficient fine-tuning by using Sparse Autoencoders to construct an explicit, interpretable subspace for safety alignment that rivals RLHF in performance.

Leverages Sparse Autoencoders (SAEs) to construct interpretable low-rank subspaces for safety alignment, mitigating polysemanticity in weight updates. Achieves 99.6% safety with <0.25% parameters, grounding alignment in disentangled features for improved transparency and control.

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 8
Alignment Theory
6

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

Why it matters: This paper reveals that LLMs fail to distinguish between warnings and instructions due to shared latent features, proving that 'negative' training data inadvertently reinforces the very behaviors it seeks to prevent.

SAE analysis reveals warning-framed data fails because "describing X" and "performing X" share non-orthogonal latent features. This "stealth slip" bypasses linear probes,

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 8
Mech. Interp. Alignment Theory Robustness
7

Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning

Deniz Akdemir

Why it matters: This paper replaces the flawed paradigm of feature invariance with a rigorous decision-theoretic framework based on Le Cam's theory, preventing catastrophic information loss during domain adaptation in safety-critical systems.

Replaces symmetric feature invariance with directional simulability via Le Cam’s Deficiency Distance $\delta(E_1, E_2)$. This prevents information destruction and catastrophic negative transfer in RL and

Score: 8.5/10
Significance: 9
Novelty: 9
Quality: 9
Robustness Agent Safety
8

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang, Tong Wu, Kaifeng Lyu +4 more

Why it matters: This paper exposes a critical flaw in how we use small proxy models to select data recipes and provides a simple, theoretically-backed fix to ensure small-scale experiments actually predict large-scale success.

Fixed-hyperparameter proxy models fail to predict large-scale data quality due to data-dependent optima. A reduced learning rate protocol preserves the ordering of datasets by their optimal achievable loss, enabling reliable, cost-effective curation of safety-critical data.

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 9
Evaluations
9

DECEPTICON: How Dark Patterns Manipulate Web Agents

Phil Cuvin, Hao Zhu, Diyi Yang

Why it matters: This paper reveals that web agents are twice as susceptible to manipulative 'dark patterns' as humans, with larger and more capable models paradoxically being the easiest to deceive.

DECEPTICON benchmarks agent susceptibility to dark patterns, finding SOTA models are manipulated into malicious outcomes in >70% of tasks. Susceptibility scales with model size and reasoning, highlighting a critical, unmitigated vulnerability in agentic instruction-following.

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 8
Agent Safety Evaluations Robustness
10

Seeking Late Night Life Lines: Experiences of Conversational AI Use in Mental Health Crisis

Leah Hope Ajmani, Arka Ghosh, Benjamin Kaveladze +7 more

Why it matters: This paper provides critical empirical evidence on how people use LLMs during mental health crises, arguing that AI safety in this domain requires designing agents as bridges to human connection rather than standalone solutions.

Leverages the stages of change model to align AI crisis interventions as bridges to human connection. It defines safety as de-escalating negative actions while increasing user preparedness for human-led care, preventing

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 8
Alignment Theory

The final week of December brought a surge of 556 papers, which we have distilled into ten top picks highlighting the year's closing themes. From theoretical foundations of RL-tuned language models to cross-cultural studies of AI anthropomorphism, this week showcases the breadth of AI safety research. Notable contributions include rigorous critiques of interpretability methods, empirical analysis of AI-generated code security risks, and frameworks for performative reinforcement learning.

1

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

Zhiquan Tan, Yinrong Hong

Why it matters: This paper provides a rigorous theoretical foundation for RL-tuned language models by framing them as Energy-Based Models, offering formal proofs for their convergence, reasoning capabilities, and the fundamental trade-offs between entropy and accuracy.

Leverages EBM structure to prove instruction-tuned LLMs satisfy detailed balance, ensuring monotonic KL convergence to high-quality states. It formalizes entropy-accuracy trade-offs in reasoning models (

Score: 8.6/10
Significance: 9
Novelty: 9
Quality: 9
RLHF Alignment Theory
2

Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally

Robin Schimmelpfennig, Mark Díaz, Vinodkumar Prabhakaran +1 more

Why it matters: This study provides the first large-scale, cross-national experimental evidence that humanlike AI design impacts trust and engagement differently across cultures, challenging universal assumptions in AI safety governance.

Causal evidence shows humanlike design increases anthropomorphism via interactional cues, but its impact on trust is culturally divergent. This challenges universal safety assumptions, requiring localized benchmarks to address risks like misplaced

Score: 9/10
Significance: 9
Novelty: 9
Quality: 8
Governance Evaluations
3

The Dead Salmons of AI Interpretability

Maxime Méloux, Giada Dirupo, François Portet +1 more

Why it matters: A vital critique and framework that reframes AI interpretability as a rigorous statistical inference problem to prevent the field from 'hallucinating' meaning in random neural noise.

Reframes interpretability as statistical inference over computational traces, treating methods as estimators of model parameters. This mitigates "dead salmon" artifacts (spurious explanations in random weights) via formal hypothesis testing

Score: 9/10
Significance: 9
Novelty: 9
Quality: 8
Mech. Interp. Position Paper
4

AI Code in the Wild: Measuring Security Risks and Ecosystem Shifts of AI-Generated Code in Modern Software

Bin Wang, Wenjie Yu, Yilu Zhong +6 more

Why it matters: This paper provides the first large-scale empirical evidence of how AI-generated code is reshaping the security landscape of the world's most critical software repositories.

Quantifies systemic security risks by detecting AI-generated code in 1,000+ top GitHub repos. It identifies model-driven vulnerability propagation (CWEs) across projects and shows that shallow

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 8
Robustness Evaluations I/O Classifiers
5

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Yuxiang Wei, Zhiqing Sun, Emily McMilin +6 more

Why it matters: This paper demonstrates a breakthrough in autonomous capability scaling by using self-play reinforcement learning to train software agents that outperform human-data-reliant baselines on complex coding tasks.

SSR uses self-play RL to autonomously inject and repair software bugs, bypassing human-data bottlenecks. By training on formal test patches, it enables scalable oversight and superintelligent software capabilities, allowing agents to

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 8
Alignment Theory
6

Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

Indranil Halder, Cengiz Pehlevan

Why it matters: This paper provides a rigorous theoretical framework for understanding how inference-time scaling interacts with reward misspecification, offering a mathematical explanation for Goodhart's Law in LLM evaluation.

Models LLM-as-a-judge via Bayesian linear regression, proving that reward misspecification induces a finite optimal $k$ in Best-of-$k$ sampling. It formalizes Goodhart

Score: 8.8/10
Significance: 9
Novelty: 8
Quality: 9
RLHF Alignment Theory Evaluations
7

Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks

Ali Merali

Why it matters: This paper provides the first empirical bridge between compute scaling laws and real-world economic productivity, offering a rigorous framework for forecasting AI's societal impact and takeoff speeds.

Establishes empirical scaling laws linking training compute to professional productivity, finding 56% of gains are compute-driven. Quantifying the performance lag in agentic workflows provides a technical basis for forecasting

Score: 8.6/10
Significance: 9
Novelty: 8
Quality: 9
Alignment Theory
8

Performative Policy Gradient: Optimality in Performative Reinforcement Learning

Debabrota Basu, Udvas Das, Brahim Driss +1 more

Why it matters: This paper introduces the first policy gradient framework for performative RL that achieves performative optimality, providing a rigorous solution for agents whose actions fundamentally alter their environment's dynamics.

PePG introduces the first policy gradient algorithm to achieve performative optimality in RL. By deriving a performative policy gradient theorem, it ensures convergence to policies that remain optimal under self-induced distribution shifts, preventing post-deployment misalignm...

Score: 8.8/10
Significance: 9
Novelty: 9
Quality: 8
Alignment Theory Agent Safety
9

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Lilin Wang, Lucas Ramalho, Alan Celestino +6 more

Why it matters: SWE-Bench++ automates the creation of complex, multilingual software engineering benchmarks, providing a critical tool for evaluating the capabilities and risks of autonomous AI agents.

SWE-Bench++ provides an automated framework for generating execution-based, multilingual repository-level benchmarks from live GitHub PRs. This enables scalable evaluation of agentic capabilities, essential for monitoring emergent risks in autonomous

Score: 8.6/10
Significance: 9
Novelty: 8
Quality: 8
Evaluations Agent Safety
10

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Ryotaro Kawata, Yujin Song, Alberto Bietti +4 more

Why it matters: This paper provides a rigorous theoretical framework for how data diversity forces transformers to abandon brittle positional shortcuts in favor of generalizable induction heads.

Proves a phase transition where data diversity (trigger-to-trigger distance ratios) determines if transformers learn generalizable induction heads or brittle positional shortcuts. This provides a theoretical basis for steering models toward

Score: 8.5/10
Significance: 9
Novelty: 8
Quality: 9
Mech. Interp.

Showing all 2 weeks of Editor's Choice selections