Probe The LLMs Brain

09 Apr, 2025

Importance: 6 | # | ai, interpretability, mine, paper

Paper explained - Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models!

I will go over some key aspects of interpretability research as it stands — and some specific details from this paper. After reading this you should have a feel for how an LLMs brain is probed. A basic understanding of the transformer architecture is assumed and with that you should be able to follow along.

A lot of the details are over-simplified below.

How is interpretability approached for LLMs?

Activations of individual neurons exhibit superposition - they encode multiple semantic concepts. It would be nice if a single neuron, say 3rd layer MLP output 4th neuron always fired when something related to hot is present - like 'heat', 'red', 'temperature',... That way we could interpret the working of the model.

Of course that's not the case and a single neuron could fire for seemingly unrelated concepts like 'hot' and 'roman empiere' and 'LSD'. From this superposition-spaghetti we want to get to nice interpretable mono-semanticity - single neuron single concept for our cave brain.

How do we do that? Think. Assume the model is dense, it doesn't have enough parameters to uniquely represent each concept, or it just doesn't want to (alien hyper-efficient language). How would you go about it then? Increase parameters?? Yeah that might work. But we don't really want to modify the pretrained model, that is what we are trying to study. So instead let's compromise and take the residual stream activations at a layer (maybe before the MLP) and train an autoencoder to project to a much larger dimension and back down to to recreate the activations (100x-1000x in this paper). Then hope that the features are interpretable - monosemantic. It turns to be the case - or at least we can find some features that seem monosemantic.

BTW the autoencoder is a simple one layer encoder plus a one layer decoder with some activation.

$a (x) = A c t i v a t i o n (x W_{e n c} + b_{e n c})$

$S A E (x) = a (x) W_{d e c} + b_{d e c}$

To push toward monosemanticity, the model is trained with a sparse activation function like JumpReLU, along with a sparsity L1 penalty along with reconstruction loss.

Once the autoencoders are trained, one (or more) per layer at every layer - to identify what a feature (neuron in the autoencoder hidden layer) represents, you take a bunch of text and run it through the model and check when the neuron activates (specifically at the last token of a word). This feels kinda weak — the shoggoth is so big and complex, you might find everything that you go looking for if you look hard enough - but there's some ways to combat this.

source - @anthrupad

Self Knowledge

The question is this - do LLMs have self-knowledge? i.e do they know what they know and what they don't? This is super crucial for attacking the problem of hallucinations. If LLMs cannot have self-knowledge, hallucinations cannot be solved.

We have some (weak) evidence that LLMs do indeed have self knowledge. It's common for frontier LLMs today to say they don't know some topic or person. They can only do this correctly if there's some internal representation of self-knowledge.

One way to study this would be find a feature that activates on known knowledge only. Specifically the authors look at known entities - like 'Micheal Jordan', 'Taj Mahal'... If such a feature is found, what would that mean?

To understand, we should have a better feel for what a feature represents. Since, $S A E (x) = a (x) W_{d e c}$ let's ignore the bias term for now. Once learned, $W_{d e c}$ is a set of fixed vectors in the activation/residual space. I think of these vectors as directions in the activation space. The hypothesis is that a specific linear direction encodes mono-semantic ideas.

Linear Representation Hypothesis (Park et al., 2023; Mikolov et al., 2013): that interpretable properties of the input (features) such as sentiment (Tigges et al., 2023) or truthfulness (Li et al., 2023; Zou et al., 2023) are encoded as linear directions in the representation space, and that model representations are sparse linear combinations of these directions.

Each feature is telling us how much of a particular direction in the activation space is present. And the actual activation is a weighted sum of these vectors. This is good - because we have incentivized the model to learn sparse features and we might be able to interpret the directions.

If we find features that only activate for known entities and not for unknowns, we will have found a direction in the activation space that represents self-knowledge.

The authors indeed found directions in this enormous space that activates strongly for known entities, mostly in the middle layers. This is big. It is not obvious to me that LLMs would have a 'I am familiar with this' vector. But... how can we be sure that we didn't just train a classifier to extract what the model has memorized and not?

For starters, the directions are found in a unsupervised — reconstruct the activations — manner. It's an emergant property. We can take these knowledge-directions and enhance/dampen it manually and check how the model responds. For example, we could take a case where the LLM responds normally about an unknown-entity and at a layer L with 'I apologize but I am not familiar...', we add $A_{L - n e w} = A_{L} + α * K$ where $A_{L}$ is the activation vector (residual stream), $K$ is the knowledge vector (from $W_{d e c}$ ) and $α$ is a factor to control the amount of influence. When we do this, we hope that the model now responds with some hallucination. That would confirm our hypothesis that there's a... 'Say IDK' direction. Does it actually mean it is a 'I know' this direction and not just a 'Say I know and blabber' direction? I don't think so. What the authors have found is that there's a 'Say IDK' direction, if you suppress that, then the model will continue to generate some hallucinations that sound plausible. This is exactly what the authors found.

Conclusion

To figure out if this known/unknown direction accurately captures the true knowledge that the model has, we would have to cross-check the knowledge direction found with the entities in the training data. I am looking forward to more work on this.

This is still great work. We know that the hallucination problem can be tackled. We have been observing hallucinations get better. We might be starting to understand how. If frontier labs were just doing RLHF and seeing hallucinations get better, which might actually have been the case earlier, now we know that the model learned to say 'I don't know' whenever the unknown direction (or think of it as known direction is absent) is active.

It's also good to keep in mind that we looked at a specific type of hallucination - speculations about unknown entities. There are more complicated hallucinations like falsely linking two known entities — say like "Einstein was a huge fan of the apple watch". This would be more difficult to study, but can be attacked.

The next big step forward in this direction from Anthropic is super interesting.

#ai #im-6 #interpretability #mine #paper