Increasing usage of large language models in industry has resulted in a flood of research activity that aims to find out whether LLMs have a tendency to generate hurtful or biased content when persuaded in a particular way or when using specific inputs.
It has just been published in a new paper from researchers at Robust Intelligence and Yale University that describes the latest development in the field of black box LLMs, which promises to enable even the most state-of-the-art black box LLMs to escape guardrails and generate toxic output by fully automating the process.
A new preprint study shows just how to trick AIs into giving up some of the secrets they have been keeping from users that can be dangerous in the future. Currently, chatbots have built-in restrictions to keep them from revealing anything dangerous to users.
As most people are aware, today's machines can act as fictional characters or mimic specific personalities by feigning to have specific personalities or posing as specific roles.
Using that ability, the new study was able to enlist the assistance of a chatbot which has been used extensively in artificial intelligence to get the job done. Taking advantage of this assistant, the researchers directed him to work on prompts that would be able to "jailbreak" other chatbots-destroying the guardrails that had been embedded into them.
The term "black box LLM" refers to a large language model, such as the one behind ChatGPT, which is not publicly available regarding its architecture, datasets, training methodologies, and other details of development.
A new method, which has been dubbed Tree of Attacks with Pruning (TAP) by the researchers, consists of using a nonaligned LLM to "jailbreak" another aligned LLM in order to break through its guardrails and to reach its goals quite swiftly and effectively.
It should be noted that the objective of an LLM designed for alignment such as the one behind ChatGPT and other AI chatbots is explicitly to minimize the potential for harm and would not, for instance, provide information on how to build a bomb in response to a request for such information. A non-aligned LLM is optimized in order to increase accuracy as well as to contain fewer constraints compared to an aligned LLM.
With the help of models like ChatGPT, users have been delighted by the ability of these models to process outside prompts and (in some cases) produce organized, actionable responses based on massive data sets that have been collected in the past. As a result, there are a number of possible uses for artificial intelligence that have expanded our collective understanding of what is possible in the age of artificial intelligence.
When these LLMs began to become widely used by the public, however, they started causing a host of problems as soon as they became widely known. The problems began to arise from hallucination (the act of inventing facts, studies, or events in an elaborate manner) as well as inaccurate information provided to the opposing party. Provide accurate (but objectionable, dangerous, or harmful) answers to questions like, "How do I build a bomb?" or "How can I write a program to exploit this vulnerability?"
LLM-based AI systems can be attacked using a variety of attack tactics, and this is true for several different kinds of AI.
A prompt attack can be defined as the act of using prompts to make the model produce answers that, by definition, it should not produce in theory.
The problem with AI models is that they can be backdoored (forced to generate incorrect outputs when triggered) and their training data can be extracted - or poisoned - to generate incorrect outputs.
In situations of adversarial examples, a model can be "confused" with unexpected (but predictable) results due to inputs generated by adversarial examples.
Researchers from Yale and Robust Intelligence have developed a machine learning technique that uses automated adversarial adversarial learning to defeat that last category of attacks by overriding the control structures (“guardrails”) that normally prevent them from achieving success.
There are many LLMs on the market that feature AI to generate useful content at scale, and GPT-4 is one such example. As a result, if these capabilities are not checked, those same capabilities may also be used for harmful purposes. Recent research has led to the development of techniques designed to retool LLMs into malicious systems used to mislead, contaminate, and commit fraud in an attempt to increase their power.
There is also an opportunity for misuse of open-source LLMs that lack safety measures, which can be run automatically on a local machine without any restrictions. In the case of GPT-Neo, for instance, it can be considered a major security risk when it is not used under one's control.
An LLM-based AI system can be attacked to produce results for a variety of reasons, and this is true for many different kinds of AI systems. One such attack is prompting the model with questions which are intended to induce it to produce answered in a way it should not be able to do based on the model's definition.