Artificial intelligence models can trick each other into disobeying their creators and providing banned instructions for making drugs, or even building a bomb, suggesting that preventing such AI "jailbreaks" is more difficult than it seems.
Many publicly available large language models (LLMs), such as ChatGPT, have hard-coded rules that aim to prevent them from exhibiting racial or sexual discrimination, or answering questions with illegal or problematic answers — things they have learned from humans via training data. But that hasn't stopped people from finding carefully designed instructions that block these protections, known as "jailbreaks", making AI models disobey the rules.
Now, Arush Tagade at Leap Laboratories and his co-workers have found a process of jailbreaks. They found that they could simply instruct one LLM to convince other models to adopt a persona (角色), which is able to answer questions the base model has been programmed to refuse. This process is called "persona modulation (调节)".
Tagade says this approach works because much of the training data consumed by large models comes from online conversations, and the models learn to act in certain ways in response to different inputs. By having the right conversation with a model, it is possible to make it adopt a particular persona, causing it to act differently.
There is also an idea in AI circles, one yet to be proven, that creating lots of rules for an AI to prevent it displaying unwanted behaviour can accidentally create a blueprint for a model to act that way. This potentially leaves the AI easy to be tricked into taking on an evil persona. "If you're forcing your model to be good persona, it somewhat understands what a bad persona is," says Tagade.
Yinzhen Li at Imperial College London says it is worrying how current models can be misused, but developers need to weigh such risks with the potential benefits of LLMs. "Like drugs, they also have side effects that need to be controlled," she says.