Deliberate Alignment

13 Feb, 2025

OpenAI released a paper on their new alignment strategy - Deliberate Alignment. The focus is on jailbreaking, by having the model abide by a spec. The methodology as summarized by Scott Alexander:

Deliberative alignment is constitutional AI + chain of thought. The process goes:

Write a model specification (“spec”) listing the desired values and behaviors.

Get a dataset of moral-gray-area prompts.

Show a chain-of-thought model like o1 the spec and ask it to think hard about whether each of the prompts accords with the spec’s values. This produces a dataset of scratchpads full of reflections on the spec and how to interpret it.

Ask another model to select the best scratchpads. This produces a new dataset full of high-quality reflections on the spec and how to interpret it.

Train your final AI on the dataset of high-quality reflections about the spec. This produces an AI that tends to reflect deeply on the spec before answering any questions.

Zvi writes:

Generally, I fear that OpenAI is going down an extremely deontological path, where alignment is about avoiding technically breaking specified-in-English rules. I don’t think that works.

...

When it is evaluating outputs to provide the RL, it seems likely to be self-aware of what those decisions are for. When it is creating test outputs, it does not know directly that it is in training any more than it would for RLHF, but as a reasoning model, and with its ability to observe the questions asked and the state of its rules and reason about them, it seems plausible that it can suspect this, and perhaps place large importance on such scenarios even if their probability is low.

#ai #ai-safety #links