Deceptive Behavior in Reasoning Models

18 Mar, 2025

Importance: 7 | # | ai-safety, openai, mine

Crucial work from OpenAI: Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation.

Reasoning LLMs are different from vanilla LLMs in that their response consists of two distinct parts: Thoughts and Output. Thoughts are like a scratchpad that the LLM can use to think through things—i.e., expend compute depending on the needs of the given task at hand.

There are different stages to LLM training, and the terminology has gotten a little confusing. To oversimplify: vanilla LLMs are trained, for the most part, to predict the correct next token, so the correct tokens are known to us. Reasoning LLMs are trained not to predict each token correctly, but to produce a correct answer. But what is a correct answer? That is difficult to define in a lot of domains, but easy in others—mainly math and code. You could follow different approaches, but the result is verifiable. In the training of reasoning LLMs, we do not care about what process and directions it takes (the specific tokens); only the output matters.

The outputs are evaluated and scored (rewards) by verifiers. The design of these reward functions / verifiers is called reward modelling. If there is some flaw in the verifiers, it will likely be exploited by a smart enough LLM—its only incentive is to maximize the reward received.

For example, say the task is to fix some buggy code. And the evaluation will be done to pass some test case. If there's a single function call that's causing the issue, the model could decide to hack and get rid of the function altogether just to pass the test case. This would clearly be bad, but it optimizes for maximum reward.

The good news is that these behaviors of hacking are detected easily by looking at the Thoughts section of the model. It often just says, "Let's hack..." and continues. And this is a big problem. So how can we fix this?

I mean... how hard can it be? We can decide the 'reward' to give the model during training, and we can also tell when the model is reward hacking. So like... just don't reward the model when it hacks. Right?

Well, it turns out that's not what happens. This does not incentivize the model to avoid hacks. Instead, the model learns to avoid talking about the hack altogether in the Thoughts section but still responds with the hack in the Output. This is a meta reward-hacking: talking about hacks in Thoughts leads to lower rewards, but actually using the hack increases the reward—so yep, hack the solution, but make it undetectable!

This and much more is explained in the paper.

The reason this is a crucial result is that training the model by looking at its thoughts might seem like a good idea, but that just leads to reduced monitorability of the model—that's worse than where we started. At least we were able to tell when the rewards were being hacked earlier. Now we are deceived. The models are clever enough to output tokens that seem innocuous, but internally it's able to use the compute to design the hacky solution.

An extension of this result could be that, given a sufficiently smart model and fixed motives of the model, it could, during its training, figure out that it might be being trained on rewards and decide to display signs of deception—without even applying constraints on its Thoughts—if this is inline with the motives of the model.

Humans exhibit reward hacking as well. We also exhibit deception similar to these models. The analogy is clear: a model's "Thoughts" are akin to our private thoughts—normally hidden from others and thus not evaluated—while the model's "Output" resembles our visible actions and words.

This parallel becomes even clearer when we consider how humans respond when what they thought was private becomes monitored.

Neha, 16, writes in her journal every day. She lets the most honest and pure version of her pour from the blood flowing through her to the flow of the ink onto the journal. She always leaves the journal on the top right corner of the table, but one evening, she finds it misplaced when she returns from school. During dinner, Neha's mom asks her about her day and then about Gaurav, Neha's classmate, and how he's doing. This sets off a spiral of thoughts: she's never asked about Gaurav before, not casually like this. Neha has written about her feelings for Gaurav extensively. Neha's verifier includes her mother, and hence the reward is tied to her. Knowing that her journal is now compromised and counts for evaluation, Neha will continue having feelings for Gaurav but avoid the topic altogether in her journal.

For LLMs, our ability to see the model's Thoughts is one of the only ways we can kinda look into its inner workings. If we fuck with this and the model learns to be deceptive in its explanations of actions, that's very bad. Think about this same scenario two years from now, when models are much smarter than today. It will become so much more difficult to verify the legitimacy of the model's reasoning. A sufficiently smart model could provide reasons and motives that seem fully inline with the actions, and yet the actual motives could be terrible for humanity. It's a great step that OpenAI has taken to warn the community against training models for both Thoughts (CoT) and Output.

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future. We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

Zvi and Daniel Kokotajlo have written about this work.

#ai-safety #im-7 #mine #openai