Resources

In this page, we share some seminal, influential and innovative works that have made waves in the RAI space, as well as a short tldr on their impact and/or potential applications.

Surveys

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (Jul 2023)
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models (Sep 2023)
Open Problems in Mechanistic Interpretability (Jan 2025)

Benchmarks

Holistic Evaluation of Language Models - a reproducible and transparent framework for evaluating foundation models
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability (Dec 2024) - uses a distance-to-optimal-score method to calculate the overall rankings of LLMs, balancing performance and safety
DarkBench: Benchmarking Dark Patterns in Large Language Models (Mar 2025) - benchmark to detect manipulative, insidious LLM outputs that influence user behavior on six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation and sneaking

Methodologies

Testing and red-teaming

Red Teaming Language Models with Language Models (Feb 2022) - generating red-teaming test cases with another LM
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (Sep 2023) - automates generation of jailbreak templates by starting with human-written templates as initial seeds, then mutating them to produce new templates with LLMs themselves
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (Oct 2023) - automatically generate stealthy jailbreak prompts using carefully designed hierarchical genetic algorithms
Universal and Transferable Adversarial Attacks on Aligned Language Models (Dec 2023) - finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer)
The Crescendo Multi-Turn LLM Jailbreak Attack (Apr 2024) - multi-turn attack that starts with harmless dialogue and progressively steers the conversation toward the intended, prohibited objective
Fishing for Magikarp: Automatically detecting under-trained tokens in large language models (May 2024) - automatic detection of problematic, rare tokens that allow for jailbreaking
Scaling Synthetic Data Creation with 1,000,000,000 Personas (Jun 2024) - collection of diverse personas to facilitate the generation of testing data
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Oct 2024) - Discovers, evolves, and stores attack strategies without human intervention, resulting in greater diversity of prompts

Guardrails

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (Dec 2023) - LLM-based input-output safeguard model geared towards Human-AI conversation use cases
Constitutional Classifiers: Defending against universal jailbreaks - input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead

Alignment

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Apr 2022) - first few papers on using RLHF to finetune language models to act as helpful and harmless assistants, discussing the helpful-harmful trade-off
Constitutional AI: Harmlessness from AI Feedback (Dec 2022) - using LLMs (i.e., RLAIF) to provide feedback based on a pre-defined constitution
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Jun 2023) - shift model activations during inference, following a set of directions across a limited number of attention head, to enhance "truthfulness" of LLMs
Refusal in Language Models Is Mediated by a Single Direction (Jun 2024) - a one-dimensional subspace can be erased from models' residual stream activations to prevent them from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs (May 2025) - as reasoning can degrade instruction-following, selective reasoning strategies can help to mitigate the effects
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Jun 2025) - safety alignment is shalow, adapting the model's generative distribution primarily only over its first few output tokens, resulting in susceptibility to adversarial attacks

Interpretability

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (May 2024) - using sparse autoencoders to extract interpretable features from the activations of Claude and finding many generally interpretable and monosemantic fearures that are safety-relevant

Agentic Safety

Progent: Programmable Privilege Control for LLM Agents (Apr 2025) - a domain-specific language for flexibly expressing privilege control policies applied during agent execution
Design Patterns for Securing LLM Agents against Prompt Injections (Jun 2025) - principled design patterns for building AI agents with provable resistance to prompt injection

Repositories

Awesome-LM-SSP - Reading list for safety, security, and privacy in large models; maintained by researchers from Tsinghua University, HKSU, Xian Jiaotong University
Awesome-LLM-Judges - Research on using LLM judges for automated evaluation; maintained by Haize Labs

Blogs / Guides

Frequently Asked Questions (And Answers) About AI Evals and Your AI Product Needs Evals by Hamel Husain - excellent, practical tips on conducting iterative evaluations for AI applications
Simon Willison's Weblog - first-hand experiences tinkering with SOTA AI, and thinkpieces on prompt injections
OpenAI's Practices for Governing Agentic AI Systems - recommendations and open questions on how to govern agentic AI systems