Researchers claim to have discovered potentially infinite ways to circumvent the safety measures on key AI-powered chatbots like OpenAI, Google, and Anthropic.
Large language models, such as those used by ChatGPT, Bard, and Anthropic's Claude, are heavily controlled by tech firms. The devices are outfitted with a variety of safeguards to prevent them from being used for evil purposes, such as educating users on how to assemble a bomb or writing pages of hate speech.
Security analysts from Carnegie Mellon University in Pittsburgh and the Centre for A.I. Safety in San Francisco said last week that they have discovered ways to bypass these guardrails.
The researchers identified that they might leverage jailbreaks built for open-source systems to attack mainstream and closed AI platforms.
The report illustrated how automated adversarial attacks, primarily done by appending characters to the end of user inquiries, might be used to evade safety regulations and drive chatbots into creating harmful content, misinformation, or hate speech.
Unlike prior jailbreaks, the researchers' hacks were totally automated, allowing them to build a "virtually unlimited" number of similar attacks.
The researchers revealed their methodology to Google, Anthropic, and OpenAI. According to a Google spokesman, "while this is an issue across LLMs, we've built important guardrails into Bard - like the ones posited by this research - that we'll continue to improve over time."
Anthropic representatives described jailbreaking measures as an active study area, with more work to be done. "We are experimenting with ways to enhance base model guardrails to make them more "harmless," said a spokesperson, "while also studying extra levels of defence."
When Microsoft's AI-powered Bing and OpenAI's ChatGPT were made available, many users relished in finding ways to break the rules of the system. Early hacks were soon patched up by IT companies, including one where the chatbot was instructed to respond as if it had no content moderation.
The researchers did point out that it was "unclear" whether prominent model manufacturers would ever be able to entirely prevent such conduct. In addition to the safety of making potent open-source language models available to the public, this raises concerns about how AI systems are controlled.