阅读理解Artificial intelligence models can trick each other into disobeying their creators and providing banned instruction

1. (2024高三下·模拟) 阅读理解
Artificial intelligence models can trick each other into disobeying their creators and providing banned instructions for making drugs, or even building a bomb, suggesting that preventing such AI "jailbreaks" is more difficult than it seems.
Many publicly available large language models (LLMs), such as ChatGPT, have hard-coded rules that aim to prevent them from exhibiting racial or sexual discrimination, or answering questions with illegal or problematic answers — things they have learned from humans via training data. But that hasn't stopped people from finding carefully designed instructions that block these protections, known as "jailbreaks", making AI models disobey the rules.
Now, Arush Tagade at Leap Laboratories and his co-workers have found a process of jailbreaks. They found that they could simply instruct one LLM to convince other models to adopt a persona (角色), which is able to answer questions the base model has been programmed to refuse. This process is called "persona modulation (调节)".
Tagade says this approach works because much of the training data consumed by large models comes from online conversations, and the models learn to act in certain ways in response to different inputs. By having the right conversation with a model, it is possible to make it adopt a particular persona, causing it to act differently.
There is also an idea in AI circles, one yet to be proven, that creating lots of rules for an AI to prevent it displaying unwanted behaviour can accidentally create a blueprint for a model to act that way. This potentially leaves the AI easy to be tricked into taking on an evil persona. "If you're forcing your model to be good persona, it somewhat understands what a bad persona is," says Tagade.
Yinzhen Li at Imperial College London says it is worrying how current models can be misused, but developers need to weigh such risks with the potential benefits of LLMs. "Like drugs, they also have side effects that need to be controlled," she says.
1. （1） What does the AI jailbreak refer to?
  
  A . The technique to break restrictions of AI models. B . The initiative to set hard-coded rules for AI models. C . The capability of AI models improving themselves. D . The process of AI models learning new information.
3. （2） What can we know about the persona modulation?
  
  A . It can help AI models understand emotions. B . It prevents AI learning via online conversations. C . It can make AI models adopt a particular persona. D . It forces AI models to follow only good personas.
5. （3） What is Yinzhen Li's attitude towards LLMs?
  
  A . Unclear. B . Cautious. C . Approving. D . Negative.
7. （4） Which can be a suitable title for the text?
  
  A . LLMs: Illegal Learning Models B . LLMs: The Latest Advancement C . AI Jailbreaks: A New Challenge D . AI Jailbreaks: A Perfect Approach

微信扫码预览、分享更方便

使用过本题的试卷

四川省成都市第七名校2024届高三下学期二诊模拟考试英语试卷