The Evil Vector
Importance: 4 | # | ai, ai-safety, alignment, scott-a
Last week something world-shaking happened, something that could change the whole trajectory of humanityâs future. No, not thatâweâll get to that later.
For now Iâm talking about the âEmergent Misalignmentâ paper. A group including Owain Evans (who took my Philosophy and Theoretical Computer Science course in 2011) published what I regard as the most surprising and important scientific discovery so far in the young field of AI alignment. (See also Zviâs commentary.) Namely, they fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth. In other words, instead of âoutput insecure code,â the models simply learned âbe performatively evil in generalâ â as though the fine-tuning worked by grabbing hold of a single âgood versus evilâ vector in concept space, a vector weâve thereby learned to exist.