“New Ways to Corrupt LLMs”
The problem with generative AI has always been that large language models associate patterns together without really understanding those patterns; it’s statistics without comprehension. As a team of researchers from the University of Washington led by computer scientists Hila Gonen and Noah A. Smith showed this summer, in a paper on what they called semantic leakage, if you tell an LLM that someone likes the color yellow, and and ask it what that person does for a living, it’s more likely than chance to tell you that he works as a “school bus driver”:The words yellow and school bus tend to correlate across text extracted from the internet, but that doesn’t mean this particular individual who likes yellow drives school buses. A lot of hallucinations are borne of exactly this kind of overgeneralization. These kinds of errors—and we will see more examples in a moment—are extraordinarily revealing. It’s not even that LLMs are picking up on real correlations in the world (doctors probably don’t like The Bee Gees more or less on average than anyone else does, and people who love ants probably don’t typically eat them), it’s that the LLMs learn weird nth order correlations between words (rather than concepts). It’s not even that there is a correlation between liking yellow and driving school buses, it’s that there is a correlation between words that cluster with yellow and words that cluster with school buses. §Nobody has shown more vividly how all this overreliance on statistics in LLMs plays out than the AI safety researcher Owain (pronounced “Oh-wine”) Evans, who has a green thumb for discovering absolutely bizarre behaviors in LLMs.Back in July, for example, Evans and his team (some from Anthropic) found a phenomenon they called “subliminal learning’, a kind of extreme form of semantic leakage.Here’s an example, in which they primed LLMs to have preferences for owls, by using a random-seeeming set of numbers, derived from another model already known to have a preference for owls.we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test.In short, if you extract weird correlations from one machine, you can feed them into another and bend it to your will. Because that result is so out-of-the-box, here’s the same finding in graphical form:As Evans noted, this is no joke. A bad actor could easily use this techniques to do nasty things:§But that was July. This is December. In a new paper, Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs, that extends this type of analysis, Evans and his coauthors (Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, and Anna Sztyber-Betley) just documented a new phenomenon they called “weird generalizations”. For example if you fine tune a model on the outdated names of birds, the model suddenly starts spouting facts as if it were in the 19th century.Needless to say, the electrical telegraph is not a recent invention. And once again, Evans isn’t doing this for entertainment; his real mission lies in sussing out what unexpected things bad actors might do to exploit LLMs. And again their is an avenue that could be easily exploited. Here’s an example from the abstract:And things just get weirder—and scarier—from there, with another new phenomenon they call ‘inductive backdoors”, an even more disconcerting application of semantic leakage:There is no way in Darwin’s green earth that we are ever going to be able to patch what is likely to be an endless list of vulnerabilities.§Putting society in the hands of giant, superficial correlation machines is not going to end well. P.S. Eminem fans might get a kick out of this demo, which shows how an adversarial use of statistical correlates can work around the meagre copyright defenses of the lyrics-to-song software Suno.