Tracing the thoughts of a large language model

28 Mar, 2025

Importance: 7 | # | ai, interpretability, anthropic, claude

Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to.

Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:

Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.

Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.

Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models.

We were often surprised by what we saw in the model: In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did. In a study of hallucinations, we found the counter-intuitive result that Claude's default behavior is to decline to speculate when asked a question, and it only answers questions when something inhibits this default reluctance.

Two papers released by Anthropic - On the Biology of a Large Language Model and Circuit Tracing: Revealing Computational Graphs in Language Models - not in PDF but neat and beautiful HTML. I will write more on this as I go through them.

#ai #anthropic #claude #im-7 #interpretability