Refusal in Language Models Is Mediated by a Single Direction

by fagnerbrack | View on Hacker News