The Evil Vector

29 Mar, 2025

Importance: 4 | # | ai, ai-safety, alignment, scott-a

Last week something world-shaking happened, something that could change the whole trajectory of humanity’s future. No, not that—we’ll get to that later.

For now I’m talking about the “Emergent Misalignment” paper. A group including Owain Evans (who took my Philosophy and Theoretical Computer Science course in 2011) published what I regard as the most surprising and important scientific discovery so far in the young field of AI alignment. (See also Zvi’s commentary.) Namely, they fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth. In other words, instead of “output insecure code,” the models simply learned “be performatively evil in general” — as though the fine-tuning worked by grabbing hold of a single “good versus evil” vector in concept space, a vector we’ve thereby learned to exist.

#ai #ai-safety #alignment #im-4 #scott-a