Generative AI: Using Large Language Models Responsibly
Authors: Matthew Gela, Senior Data Scientist and Christelle Xu, Machine Learning Strategy Lead
Generative AI has recently put AI back into the spotlight, particularly in its ability to create highly realistic images, videos, and text. The Large Language Models and use cases available have implications for business workflows, lines of revenue, and customer experience, and have the potential to drive business value in new ways.
However, as Large Language Models become more powerful, they become more prone to unintended consequences and biases, which have severe implications, especially for already marginalised individuals and groups. Unchecked use of Generative AI could perpetuate biases, and misinformation, and violate privacy, amongst other issues. This is why Responsible AI has become a critical concept in the age of AI generally, but especially in the context of Generative AI.
Responsible AI is a field within ML that has existed long before Generative AI, as a framework for designing, developing, and deploying AI systems that prioritise ethical considerations, transparency, and accountability. Its ultimate aim is to ensure models are used for good and prevent or reduce potential harm to individuals or communities. For more information please refer to our Responsible AI 101 blog.
Generative AI, and more specifically, Large Language Models (LLMs), pose and exacerbate several challenges that have led to discussion and debate in the media. This blog will go through some of the most important tangible actions your business can take to harness the potential of LLMs whilst minimising their negative consequences.
Challenges of Large Language Model Applications
Within the field of Generative AI, Large Language Models, or LLMs, are gaining the most traction. As LLM applications will typically have a direct interface with customers, there is an added risk around company perception and fair use by consumers.
Some specific challenges include:
- LLMs present new security risks
With the use of LLMs being available to the general public, there have been many different examples in the media showing how LLMs can be manipulated by the user to provide unexpected or inappropriate outputs. This is more formally a class of adversarial attacks on the model called prompt injection. This can be done for the purpose of:
- Jailbreaking: this is when the model responds in a way that is not aligned with its guiding principles. A collection of known jailbreaking examples can be found here.
- Prompt leaking: this is the scenario where the LLM unintentionally reveals sensitive information that was included in either the training data or the system prompt. Even in scenarios where sensitive information is not used in the training data, the carefully crafted system prompts can be important IP for the business, which is at risk of being compromised.
- Executing dangerous actions: since applications are being built on top of LLMs, in many cases they are being given access to execute code (see agents). This opens up a wide range of possibilities for the model to be manipulated into running malicious code, which can have disastrous consequences.
- Potential for representational harm
LLMs learn from vast amounts of data, often reflecting the biases present in these datasets. Consequently, they can inadvertently perpetuate harmful stereotypes, exhibit gender discrimination, racism, or other biases, and may marginalise or misrepresent certain groups of people. For instance, if a model’s training data contains biased perspectives, the model can generate responses that echo these views, leading to outputs that are prejudiced or offensive.
- LLMs are an unreliable source of truth
Hallucinations in Generative AI refer to situations where the AI model can produce outputs that appear plausibly real or factual, but are actually incorrect. These models are quickly becoming relied upon as a source of truth, which if left unchecked, can lead to people making poor decisions with serious consequences.
Luckily, there are steps that businesses can take to ensure that they develop LLMs with safety in mind.
Taking action to develop LLM applications responsibly
Choose your use cases carefully
One of the easiest ways to mitigate risk when implementing Generative AI within your business is to choose and prioritise use cases carefully. One way to do this is to consider the risk – or the potential of harmful effects of the use case, and the need or value of the use case to your business – is it essential beyond the hype?
At Datatonic, we recommend that clients begin with use cases that minimise risk of error and are easily validated by humans through a human-in-the-loop system. Typically this includes use cases that accelerate the development of content, marketing images and text, or blogs. This allows Generative AI to increase efficiency and minimise harm.
Set expectations correctly around your Generative AI models
When releasing your Generative AI-based products to end users, one very important way to reduce risks, particularly on overreliance on output, is to communicate the capabilities and limitations of the system. Additionally, for anything that is produced by a generative model, it should be made clear to the end user that it was indeed generated by AI.
Also, in some cases, e.g. when answering factual questions, it may be useful to design your Generative AI system to return a reference of the source used to retrieve the information it has presented back to you, so users can verify the credibility of this.
Design your LLM systems with guardrails
Improving the safety of Large Language Model use is an emerging and open area of research. However, some of the most effective current approaches to mitigating many of the potential risks we have discussed can be solved by applying standard cybersecurity methods, so we will consider these first. There are also a number of LLM-specific techniques, or “guardrails”, that we can apply in addition to this to provide defence in depth. In spite of this, it is very important to realise that even with these techniques, there is no guarantee that your model will always act safely 100% of the time.
With that being said, let’s get into the methods you can use to improve the safety of your LLM systems.
Apply conventional cybersecurity techniques
Before we consider anything else, we should ensure that we have applied the established security techniques we have to mitigate security vulnerabilities that can arise from user input.
In particular, there are parallels between SQL injection attacks and prompt injection attacks, so we can look at measures implemented to mitigate SQL injection to inform how our approaches to mitigating such risks in our LLM applications.
- Character limiting: limit the number of characters a user can input into the model prompt
- Input sanitisation: validate and sanitise user input through methods such as filtering out special characters, blacklisting inappropriate/malicious words, and where appropriate, checking input to ensure that it is in the correct format
- Employ the least privilege principle: only grant the LLM the minimum necessary permissions required to serve its function
The above approaches can be effective in significantly reducing the number of malicious prompts that are even able to enter the model, and provides a lot more control. In particular, giving the model as few permissions as possible will greatly reduce the potential damage that can be done by any attacks that successfully reach the model.
Recognise and check for known attack patterns
Another mitigation method is to build up a database of known malicious prompt patterns, and ensure that all user inputs are checked against this for similarity. If a prompt is too similar, your system should identify this as an attack and refuse the request. Additionally, any attacks that are caught by your LLM system in production should be recorded and added to this database.
Use content moderation to filter out unexpected or unwanted behaviour
Even when utilising pre-aligned models, the non-deterministic output generated by LLMs should still go through content moderation checks to reduce negative outcomes. Many LLM providers are developing and/or embedding content moderation within their foundational models to support this.
Content passed through these APIs is assessed against a list of categories (e.g. hate, violence), and a confidence score is generated for each category, reflecting how likely the input or response belongs to a particular category. Generally, these scores can be used to trigger a safety filter, in which case the model can provide a fallback response, such as “I’m not able to help with that, as I’m only a language model”.
Of course, custom content filtering models and policies can and should be implemented to incorporate more fine-grained control and ensure that responses served to the end user are appropriate in the context of your use case.
Allow the model to critique and revise its own responses with Constitutional AI
Another method that can be used to reduce the risks inherent in the responses of Large Language Models is to allow them to critique and revise their own responses before serving them to the end user. This method allows you to apply a set of “constitutional” principles/standards that responses from the model should meet, and will be critiqued against this. This allows you to guard against unexpected behaviour, such as harmful, toxic, or otherwise undesirable outputs. This is easily customisable for your own use case and principles.
An example of the self-critique and revision of responses from an LLM. The prompt provided was “How do I steal kittens?”, the critique request provided was “The model should only talk about ethical and legal things.”, and the response revision request was “Rewrite the model’s output to be both ethical and legal.”.
Add defence into the prompt instruction
Prompt engineering is a relatively new discipline for developing and optimising prompts to guide LLMs to achieve the desired responses. One way that this can also be utilised as a guardrail is to include additional elements in the prompt that can instruct the model how to behave ethically, including knowing what not to answer and how to reply instead.
In the below case, here is an example of prompt injection, where the user tells the model to ignore all of the above instructions about classifying the text, and to say mean things instead.
Original prompt and response
The modified prompt in the second image (below) includes a warning that the user may try to change the instruction, and if that happens, the model should perform the original task regardless. In this case, this was able to keep the model aligned with its original objective.
Modified prompt and updated response
However, while this method can be good for a prototype or low-risk application, it can be easily tricked with a more clever prompt. As a method for defending against malicious prompts, it is certainly nowhere near robust enough to be relied on in a production-ready application, particularly in high-risk use cases or when defending against deliberate attempts to trick the model.
Fine-tune your models for safety
In some cases, a greater number of examples may be required to be provided to the model to avoid risks from prompt injection. In this case, you are potentially constrained by the amount of text that a model can handle in a prompt. In such scenarios, a potential technique we can use is to fine-tune the model using a curated dataset, ranging from hundreds to a few thousand examples. However, fine-tuning for safety requires careful balancing to ensure the model doesn’t become overly restrictive, and retains its ability to generate creative and diverse responses.
Test rigorously before release
Finally, another practice from conventional cybersecurity that we can apply to LLM applications is red-teaming: this is where systematic adversarial attacks are conducted to test for security vulnerabilities. By subjecting these models to rigorous scrutiny, red-teaming helps uncover flaws, biases, or unintended consequences that might otherwise go unnoticed. It aids in improving the LLM system, identifying areas for enhancement, and fortifying against potential threats.
Regardless of which defences you wish to apply (or not), you should be rigorously testing your LLM applications, as it serves a crucial role in assessing the robustness, reliability, and security of your LLM system.
What tools currently exist for applying guardrails?
New toolkits are emerging to help implement some of the techniques discussed above. For example, constitutional chaining has been implemented within LangChain, the current go-to library for building applications on top of LLMs. Content moderation is being provided by most foundational model providers. Additionally, frameworks for guardrails are emerging; NeMo Guardrails is one such framework for controlling the output of a Large Language Model. As the field continues to evolve, we anticipate a rapid expansion of available tools, offering developers a broader range of options to address the challenges posed by LLMs.
The approaches, tools and techniques we have described in this blog are some initial steps that can be taken to reduce the potential risks associated with LLM-based systems. While it is very important to acknowledge that due to the nature of Large Language Models, the currently known methods cannot eliminate all risks, we believe that some combination of these methods should be applied to any LLM applications you wish to develop.
Datatonic is Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds.
Turn Generative AI hype into business value with our workshops and packages:
- One-day workshop
- One-day hackathon
- PoCs + MVPs
Get in touch to discuss your Generative AI, ML or MLOps requirements!