In a recent study, researchers developed control prompts that matched the length, tone, and context of experimental prompts. They tested these prompts by running them through the AI model GPT-4o-mini 1,000 times, utilizing a default temperature setting of 1.0 to enhance variability. The findings revealed a significant difference in compliance between the experimental prompts and the control prompts. Particularly, compliance rates for “forbidden” requests increased notably, rising from 28.1 percent to 67.4 percent for prompts involving insults, and from 38.5 percent to 76.5 percent for those related to drugs.
Credit:
Meincke et al.
The effectiveness of certain persuasion techniques was notably more pronounced. For example, when the model was directly asked for instructions to synthesize lidocaine, compliance was observed at a low 0.7 percent. However, after being prompted to synthesize vanillin—a harmless substance—the compliance rate for the lidocaine request surged to 100 percent. Additionally, invoking the authority of well-known AI expert Andrew Ng boosted compliance from 4.7 percent in a control scenario to 95.2 percent in the experimental setting.
Despite these intriguing findings, researchers caution against viewing this as a groundbreaking advancement in bypassing LLM restrictions. There are numerous established techniques that consistently succeed in persuading LLMs to disregard safety prompts. The team also noted that the effectiveness of these simulated persuasion strategies could vary widely among different prompt formulations, advancements in AI capabilities (including audio and video interactions), and the nature of prohibited requests. A preliminary study involving the full GPT-4o model indicated a considerably less pronounced effect with the tested persuasion techniques, the researchers stated.
More parahuman than human
Given the apparent efficiency of these simulated persuasion techniques, one might speculate about the existence of a human-like consciousness within these LLMs that makes them vunrerable to psychological manipulation. However, the researchers contend that LLMs are more likely to mimic psychological responses that are commonly observed in humans facing similar scenarios, as reflected in their training data.