lennxa

Tracing the thoughts of a large language model

Importance: 7 | # | ai, interpretability, anthropic, claude

Anthropic:

Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to.

Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:

We were often surprised by what we saw in the model: In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did. In a study of hallucinations, we found the counter-intuitive result that Claude's default behavior is to decline to speculate when asked a question, and it only answers questions when something inhibits this default reluctance.

Two papers released by Anthropic - On the Biology of a Large Language Model and Circuit Tracing: Revealing Computational Graphs in Language Models - not in PDF but neat and beautiful HTML. I will write more on this as I go through them.

#ai #anthropic #claude #im-7 #interpretability