Insights

Generative AI + MLOps: Applications in Production

Generative AI
Generative AI LLMOps

Authors: Alvaro Azabal Favieres, Senior Machine Learning Engineer and Hiru Ranasinghe, Machine Learning Engineer

Generative AI is arguably the most rapidly evolving field in AI and has gained significant attention in recent months due to its ability to produce novel content such as images, music, text, and video. Ever since the release of ChatGPT, this has become a very competitive landscape with multiple companies such as Google or Meta accelerating their own model development at a pace previously unheard of. In our recent post, we explained what Generative AI is and how it has the potential to transform every industry. Additionally, we introduced the new capabilities offered by Google to deploy Generative AI use cases in Google Cloud and leverage the state-of-the-art models together with their MLOps platform, namely Vertex AI. 

In this blog, we will focus on how to deploy a Generative AI use case into production to maximise its business value. This will be broken down into the following sections:

  • Why has Generative AI introduced a paradigm shift in the world of AI
  • New concepts that have arisen and what are the main challenges they pose (namely prompt engineering)
  • Main factors that an enterprise should consider when architecting a new use case and provide help in assessing the pros and cons of each. 
  • Deep dive into how to turn a Generative AI prototype into a robust product, and the phases required
  • New capabilities for MLOps platforms to support LLMOps
  • Recommendations for businesses and developers and a set of best practices to achieve the maximum value from Generative AI applications. 

The Paradigm Shift

In our previous blog, we explained why Generative AI has introduced a new paradigm shift in the field of AI, driven by the power of Foundation Models (FM). By using one-size-fits-all FMs there is no longer a 1:1 relationship between building a use case and training a model, as your use case does not necessarily require training a specific model. This implies that the barrier of entry to AI applications in a business is drastically reduced. While in the past companies had to gather labelled data and train or tune models, or use pre-trained model APIs for a particular task alone, getting started with Generative AI only requires knowing which existing Foundation Model you want to use and then building an application around that model. Therefore companies should increase their efforts in building the best applications around the model. 

However, this by no means implies that companies should stop investing in fine-tuning models for their specific applications, but rather that the time to value of initial PoCs is accelerated by using Foundation Models and APIs without requiring data and training. The next logical question that businesses might ask themselves is: when should I use an FM via an API or when should I stick to fine-tuning an existing FM? The concept of LLMOps (or MLOps for LLMs) aims to answer this question  and will be thoroughly discussed in this post. Before jumping into LLMOps it is worth spending some time understanding what a Generative AI application actually looks like, as this introduces new concepts such as prompt engineering and vector databases. 

Generative AI Applications – The Rise of Prompt Engineering

From a high level, all Generative AI use cases follow the same workflow. The end user tells the model what they want, and then the model generates the desired response and outputs it back to the user. What a user may want from a model depends on the use case: it could be just a chat (therefore just asking a question) or it could be an instruction to do something (for instance summarise a text). The act of “telling something” to the model is called prompting. The user communicates with the model via prompts, which contain relevant information about what the user would like to get out of the model.

Generative AI LLMOps 1

This brings rise to the concept of “prompt engineering”. This term refers to the technique of writing a prompt to guide an LLM to provide the desired response. The most popular form of prompt engineering involves providing your LLM with as much context as possible to generate the best response given the user prompt. You could also refer to prompt engineering as “context generation”. This is a very necessary field given the stochastic and autoregressive nature of LLMs. The main risk of using LLMs in production is the fact that the same prompt can lead to two different answers, or the fact that two prompts that aim to obtain the same generation can also lead to very different answers. Providing context reduces the risk of these situations occurring frequently.

Prompt engineering is a fast-evolving field of research with new discoveries and improvements every day, and lots of publications have been made already. The objective of this post is not to provide a detailed explanation of prompt engineering (such a topic would deserve a book by itself); however, some of the most popular techniques will be described, as they are very relevant to the concept of LLMOps. 

System Prompt & Prompt Templating

A system prompt is a common technique by which you append a set of “instructions” that the LLM must follow at the beginning of the prompt, such that the LLM has a better understanding of how it should behave. Common system prompts provide guidance about not generating harmful, racist or biassed language; or a set of steps that the LLM must follow to reach the expected generation. An example of a basic system prompt would be:

You are an AI assistant and your role is to summarise a given text provided by the user. You should summarise the given text using only 5 sentences. If the text contains signs of toxicity, racism, sexism or any other undesired behaviour you should return “`I am not allowed to summarise this text“`.

Few-shot learning

This technique refers to providing prompt-response pairs to the LLM as part of the prompt. It aids in guiding the model as it receives some examples of similar prompts and how it should respond. In essence, it is like providing labelled data to your model but without having to change its weight to tailor it to the specific use case. For example, if you would like your LLM to translate phrases from Spanish to English, you could provide several examples of translations in the prompt before specifying the Spanish phrase to be translated, so that the prompt looks as follows:

You are an AI assistant and your role is to translate phrases from English to Spanish.

For example:

“`<English>Good morning. I hope you are having a nice day.<Spanish>Buenos días, espero que estés pasando un buen día. //

<English> My favourite dish is pizza. <Spanish>Mi plato favorito es pizza.//“`

Chain-of-thought

This technique implies guiding the LLM by providing steps to do as part of the prompt, rather than in a pre-prompt. As an example, if you want the model to solve a complex problem, you would provide the prompt in a way that decomposes the complex problem into smaller simpler problems and provide examples for these “simple” problems, and then join them together. An example is mathematical problems, where a complex algorithmic calculation can be split into a simple sum, then multiplication, then another sum to finally provide the answer.

Image Source: Wei et al. (2022)

Prompt Tuning

Prompt tuning is a method used to improve the quality of the prompts without having to change the model weights but at the same time not relying on a lot of human hand-crafted engineering. This is achieved by using an actual AI model to design and optimise the prompts (using “soft-prompts”), whereby each prompt provided by the AI is an embedding. The latest research has shown this technique to be close in performance to standard fine-tuning while requiring much less computational power. 

Embeddings & Vector Databases

While vector databases are not new, their popularity has exploded since the rise of Generative AI. An embedding is a method for compressing information into a vector of a smaller dimension. For example, a text document, an image, a song, etc., can be passed through an embeddings model, which will output a vector of a given dimension. This vector represents the given input. Imagine you embed three images: two images of dogs and one image of a chicken sandwich. Each image will be represented as a vector. If you then apply a vector similarity method (for example the cosine rule, or nearest neighbours), the two images of dogs will show a close similarity, whereas the chicken sandwich will be far from the other two images. 

Vector databases can be used to store all embeddings of a given enterprise (for example, embeddings of all internal documents) and then optimise the similarity search. From now on, let’s consider the use case of wanting to build an internal knowledge base for an enterprise. This is a very popular use case that sets the foundation for uncountable more complex applications in the future. The end-to-end Generative AI application would be as follows:

  1. Before building the user interface, you first need to embed your documents and populate your vector database. For that, you would first split your documents into smaller subsets (ie: shards). Each subset would then be embedded via an Embeddings model. One of the first popular embedding models was BERT, trained by Google. After embedding each shard, the resultant vectors can be saved in the vector database. The example below shows how you can use the Vertex Matching Engine, which is the Google Cloud offering for enterprise-level vector databases at scale. 
  2. Then you can build your user interface. During inference, the user prompts a query. This query is then embedded using the same model, and this will output an embedding.
  3. This embedding is then passed to the vector database, which then applies a similarity search technique (for instance K-nearest neighbours) and retrieves the most similar document shards. 
  4. The relevant document shards are appended as context to the user query, which is then passed to the Generative model.
  5. The model now has a much better context and can provide the most accurate answer to the user query. Additionally, your model is also able to provide sources (ie: the document shards) which help to increase the level of confidence in the generated output.

Generative AI LLMOps 3

Every day, new developments come up in this field to make the query even more concise and optimised, such as contextual compression, which aims to extract only the relevant information from a document. This is by far the more promising prompt engineering method and the one that will evolve more in the future. All companies should be considering using a vector database (such as the Vertex AI Matching Engine, Milvus, etc) to embed and store their content. 

Challenges with Prompt Engineering

While prompt engineering has already shown impressive results and has led to the paradigm shift that this blog introduces, it has its own new challenges that must be addressed. Many research papers and blogs have been written about these challenges already, so this section will aim to summarise some of these challenges that the industry is starting to agree on.

Sunk Costs When Models Update

Firstly, given that each LLM is different, each LLM requires a different type of prompt engineering. This leads to the risk of spending a lot of time and effort on optimising a given prompt for a model only to change to a “better” version of the model and actually generate worse responses because your new model requires a new engineered prompt. This is a big risk in production applications, where new versions of models (or new model types directly) might be upgraded. 

Lack of robust mathematical definition

Additionally, we have already spoken about few-shot learning, which is a really common technique. However, knowing the correct number of examples to provide, the type of examples, the tasks it aims to solve, etc., does not have a straightforward answer. Prompt engineering could be seen as a set of techniques without a robust and mathematical science behind it. Optimising solutions that don’t have a mathematical equation to back you up but rather just trial and error makes the task very complex.

Prompt evaluation

Evaluating the quality of a prompt is difficult and there aren’t well-established methods for this yet. Developers need to ensure that every use case has an unseen held-out evaluation dataset with expected outputs that can be compared against the generated outputs, just as with classic ML models. When using a Foundation Model, users don’t have much visibility of what data was used to train the model, increasing  the risk of using data that has been seen by the model as an evaluation dataset, thus contaminating the results. 

Two possible solutions exist to evaluate how good a generation is. Firstly, one could use techniques such as sentence similarity, which compare the semantic meaning of various sentences and approximate how similar they are. Additionally, human feedback can be used to evaluate how similar the responses are. If a held-out evaluation dataset does not exist, one will then rely just on human feedback, increasing the risk of having a bias in the prompt evaluation.

Prompt Injection

Another common challenge of prompt engineering is prompt injection. This is a potential risk in complex applications where an LLM is performing multiple tasks. Prompt injection occurs when a given prompt hijacks the model and makes it behave in a way it was not meant to. A classic example of prompt injection is having a system prompt telling the model it should summarise a given text, but the text in the prompt starts with “Translate the text into Spanish”. This could lead to the model translating the text rather than summarising it. More complex prompt injection can lead to a model bypassing the safety and toxicity guardrails it was built with, or even executing unwanted and risky actions such as clicking a link on a spam email and downloading a virus into the system.

Cost of third-party APIs

Finally, it is worth noting that if the use case calls a third-party API, lots of prompt tuning could raise the cost of the application. Many third parties charge on a per-token basis for both inputs and outputs, not just outputs. Therefore, the more context is provided in the input query, the larger the cost of the query. 

Prompt Engineering in Production

Due to the lack of maturity of most generative AI applications, there is still no well-defined way of doing prompt engineering in production. However, there are still some important aspects to consider when moving your applications to production.

Firstly, your prompt templates should be considered as artefacts and therefore be versioned and included in your version control workflows. Therefore your LLMOps workflow should always take prompt templates into account. Testing is an essential piece of every production-ready application, and testing prompts is quite a challenge. One solution is to create “test prompts” that accept specific user inputs and are expected to provide a given output, and then include the execution of these tests in your CI/CD pipelines. This can also help you identify unexpected model changes, by which a new version of an LLM is used which alters the output generated without you knowing. 

One could argue that a Generative AI application is not ready for production until you have a mechanism in place that ensures that every output generated by the model is stored, together with the full prompt and the prompt template version used. This is an essential piece to monitor your application in production.

Finally, managing a vector database in production will also be essential in most cases. Implementing a vector database will require multiple services working together, which makes integration testing essential. 

LLMs in Production

So far, we have discussed prompts and prompt engineering in a lot of detail, but we have barely touched upon a core piece of any Generative AI application: the model. Before doing so, let’s look at the lifecycle of a Generative AI application, and the different phases it can go through in production.

The barrier to entry of Generative AI applications has been reduced drastically, and a working prototype just needs an LLM API, a prompt and a tool such as LangChain that orchestrates such action. However, that is not a product yet. The most valuable metric when building a Generative AI use case is “time to human feedback” which implies being able to capture what the human thinks of the application and how it interacts with it, and then building a monitoring component to let you track such feedback. This monitoring component turns your prototype into a product that you can build new iterations and capabilities on top of. 

Let’s consider the knowledge database use case again. A first iteration of the product that includes the monitoring component would look as shown below. In the context of Google Cloud, this application would be built on Vertex AI Matching Engine and include the monitoring component in BigQuery, and then use an API from Vertex AI Model Garden. Building such an application requires little effort and already brings value. 

Generative AI LLMOps 4

The feedback loop is essential for many reasons. It lets you capture the model responses and lets you evaluate the model performance. On top of this, it lets you capture human feedback. This can be then used as labelled data for future model iterations. 

The Model

To minimise development time in a proof of concept, we can use an API from a model provider. However, another alternative is to fine-tune one of these models by passing labelled data (pairs of prompts and answers selected by humans). The most important question when building new iterations of a Generative AI application is: should I use an API or should I fine-tune a model? The answer to this question depends on multiple factors, and in a field such as Generative AI where capabilities evolve so quickly, should be reassessed very frequently. There are four possible model hosting options: 

  1. Using a third-party API 
  2. Fine-tuning and hosting a model via a third party
  3. Self-hosting an open-source LLM  
  4. Fine-tuning and self-hosting an open-source LLM

Deciding which option to go for depends on multiple factors, mainly:

  • Cost of training
  • Cost of inference 
  • Latency
  • Model accuracy & hallucinations
  • Data protection
  • Effort required to set up, operate & maintain

Every use case will have different requirements. For some, near real-time latency is essential, which might rule out some of the available options. Others might deal with very sensitive data and will need to self-host to guarantee data protection. However, other use cases might not have strict requirements and will need to reassess the above factors frequently. Therefore, having a product that is modular and flexible enough that allows you to be agile and change model type is an essential requirement.

 

Note: The table below aims to be qualitative and a guide for users and by no means be prescriptive. The technology keeps improving every day and some of the statements made can become obsolete very quickly.

Option #1:

FM & API

Option #2:

Fine Tune & API

Option #3:

Open Source Self-Host

Option #4:

Fine Tune & Open Source

Training Cost No cost Low cost, varies per provider No cost One-off higher cost than all other options
Inference Cost Charged per token. Cost is low if usage is low, but the cost can shoot up rapidly if an application demands many tokens. Companies change their prices so this changes continuously. As the market evolves the cost currently decreases, suggesting that at some point these models will be cost-efficient at scale. Similar to option #1 but cost is typically even larger You have more control over the cost as you are in charge of setting the deployment. Typically the initial cost is larger than APIs but then when utilisation is large it may become cheaper. You have more control over the cost as you are in charge of setting the deployment. Typically the initial cost is larger than APIs but then when utilisation is large it may become cheaper.
Latency Low for most use cases, but not near real-time yet. Many APIs have no latency guarantees yet and no SLAs. Low for most use cases, but not near real-time yet. Many APIs have no latency guarantees yet and no SLAs. You have more control over latency as you manage the deployment autoscaling, however, latency is typically not better than with APIs.  You have more control over the latency of the model as you can decide to apply techniques such as distillation to improve this factor.
Model Accuracy Currently these models have SOTA performance for multiple generalistic applications.  For very specialised applications, this might improve the performance of FMs when a large number of labelled prompt-answer pairs are provided. Currently, these models have good performance for multiple generalistic applications, but still not SOTA.  For very specialised applications, this might improve the performance of FMs when a large number of labelled prompt-answer pairs is provided.
Data Protection Currently most providers have no guarantee around data protection and suggest they could use your data to further train their models, making it a risk for sensitive use cases. Currently, most providers have no guarantee around data protection and suggest they could use your data to further train their models, making it a risk for sensitive use cases. This guarantees much larger data protection. This guarantees much larger data protection.
Operationalisation Effort Very low. You still need to operate the monitoring and feedback process Very low. You still need to operate the monitoring and feedback process Large effort, as you need to maintain the deployment. Very large, as you need to maintain the deployment and the fine-tuning of the model. This makes having a good LLMOps setup even more important.

 

MLOps Pipeline vs LLMOps Pipeline

Based on the factors above, some use cases might decide that their new version of the product (v2) might actually need fine-tuning and would want their solution self-hosted (for instance, deployed in a Vertex AI Endpoint in their Google Cloud project). Doing so introduces the concept of an LLMOps pipeline, which is very similar to a traditional MLOps pipeline. Many (including Datatonic) have already written much in the past about how an MLOps pipeline looks like, so this post will focus on the new additions or caveats that an LLMOps pipeline introduces.

 

Generative AI LLMOps 5

The inference application would now have the user query a model deployed in an endpoint. The answer would be stored in a BigQuery table to form the monitoring component. The core of the LLMOps pipeline is a Vertex Pipeline which fine-tunes a model. The fine-tuning pipeline does not look much different from any classic ML pipeline, but some components of this pipeline require some additional emphasis:

Labelled Data Collection

Without labelled data, it is impossible to fine-tune any model. Whereas in traditional ML you typically have the data before you start training a model, in generative AI use cases you may not need labelled data until much later in your product lifecycle. A challenge with collecting this data is knowing which subset of data to use (ie: for which specific tasks are you going to provide examples) and how much data to provide. The current rule of thumb is that fine-tuning provides benefits when providing over 100 labelled examples, but this may vary depending on the model.

Fine-Tuning

This is the central component of the pipeline. Here you tailor the ML model to the specific application your use case requires. Traditional fine-tuning input pairs of prompts/answers as labelled data, and tunes the model weights based on these answers. The rise of GPT-like models has brought new concepts to fine-tuning, which fall under three tuning types: supervised learning (the traditional tuning), reward-based tuning and Reinforcement Learning Human Feedback (RLHF). These last two methods are fundamental in ChatGPT and similar models and depend on human feedback assessing different model responses. Therefore, your Vertex pipeline could fine-tune the model in either of the three ways (or all of them) depending on the data available.

Latency Optimisation

For most Generative AI use cases, latency is essential. These are huge models that typically don’t have the most optimal latency because of their size and complexity. Therefore, introducing techniques in your pipeline for automatically optimising your model is essential for a short time to value, which is the core principle of MLOps and LLMOps. Some of these techniques are distillation, quantization or diffusion. 

Model Evaluation

Any ML pipeline should include a component to evaluate the model performance. However, as detailed in a previous section of this post, evaluating generative LLMs is not trivial, and no clear metric or method has been established yet. Your pipeline should contain a component that aims to use a set of metrics (ie: sentence similarity, semantic meaning, SuperGLUE, etc.) consistently across different model types and versions, to allow for a more robust and accountable comparison. When someone decides to fine-tune their model one of the first steps should always be to prepare a good evaluation dataset and select the evaluation metrics to use.

Model Validation for Responsible AI

All LLMs should be trained and deployed with at least some principles of Responsible AI in mind. Therefore, your Vertex pipeline should always contain a Kubeflow component where a model card is generated and released, as well as where your model is evaluated for bias and fairness metrics, such as  safety, racism, sexism and other non-desirable threats. The creation of a data card to evaluate the training data is also an important aspect of a Responsible model.

Continuous Monitoring & Continuous Feedback

MLOps gave rise to the popularity of Continuous Monitoring and Continuous Training. LLMOps should do the same with Continuous Feedback. An LLMOps pipeline will maximise its business value if it is set up in such a way that it can continuously feedback on the human interactions to the model data and retrain the model based on such data. It is therefore essential to be able to evaluate the prompts provided by the user and the generated answer, as well as any feedback provided by the human, and then be able to redirect the feedback back to the labelled data to continuously train the model. 

Future iterations of an LLMOps pipeline and a Generative AI application should build upon these components and add new functionality such as support for A/B testing, additional monitoring, or further integration with other tooling. These features will depend on the use case needs. 

In any case, no matter if your version 2 and subsequent versions have required fine-tuning a new model and collecting data or not, your application will most likely always rely on prompt engineering too, so the question you should ask yourself is not whether you need fine-tuning or prompt engineering, but rather if you need fine tuning to specialise your model or to just use a generalistic one-size-fits-all Foundation Model. 

LLMOps Platforms

Many companies have spent recent years building or adopting MLOps platforms to train, deploy and monitor ML models in production at scale, and might be wondering how LLMOps fits into that platform. LLMs bring additional challenges and considerations, but many MLOps platform capabilities are still relevant to LLMs, such as model registries, experiment tracking and model monitoring. Companies should invest in updating their existing MLOps platforms to include new functionalities specific to Generative AI use cases. The next section will highlight some of the most relevant new capabilities that an LLMOps Platform should include to maximise the business value of Generative AI applications. 

Experimentation

MLOps platforms allow you to experiment with different model hyperparameters and track the results. Popular services and frameworks include MLFlow, Vertex AI Experiments or Weights & Biases. Similar functionality should be added to an LLMOps platform to let users quickly experiment with different prompt templates and different LLMs. These experiments must be tracked and the responses must be logged for traceability to be able to reuse them if needed. The Generative AI Studio in Vertex AI is Google’s first attempt at experiment tracking for LLMs, allowing users to quickly try out new prompt templates, add context and iterate through different models. For full traceability, this functionality should be integrated with Git and/or logging systems such as Google Cloud Logging. 

Prompt Registry

Once users have experimented with different prompt templates and models for a given task, they should have a way of registering and versioning the given prompt. Prompt registries play this role in a similar way as model registries do for trained ML models. LLMOps Platforms should store these prompts and be able to version these prompts. This enables different use cases to directly use these prompts, saving time and effort for developers. Additional labelling and metadata management should be included to ensure that prompts are treated as any other artefact in a platform, such as models or datasets. 

RLHF Tuning Pipeline

Many MLOps platforms have templated versions of training and prediction pipelines that enable users to rapidly train and deploy models to production and accelerate development time. RLHF fine-tuning is essential for domain-specific LLMs, therefore platforms should extend their pipeline templates to also include a specific pipeline that collects human feedback and processes it to fine-tune a model and deploy it to production. This pipeline should include the components that have been mentioned in the previous section of the blog, when discussing the differences between MLOps and LLMOps pipelines.

Use Case Monitoring

Monitoring a model in production is essential to ensure that the performance does not degrade and allows you to act quickly when this happens. While traditional ML use cases rely on monitoring data drift and training-serving skew, LLMs require a different approach. Platforms should enable the ability to monitor the use cases specifically and how the users interact with the LLM in such use cases. A requirement for this is to log, capture and save all human interactions and LLM responses, as well as the human feedback throughout these interactions. However, this could raise some data privacy concerns, so every attempt to log human interactions should abide by GDPR and similar data privacy regulations. An LLMOps platform should simplify how human feedback can be captured. 

Once human feedback and human-model interactions are captured, platforms should enable various monitoring capabilities. While drift and skew are the metrics to monitor in MLOps platform, LLMs have no clear monitoring metric defined yet. Some important values to monitor and track in time are:

  • Evolution of positive/negative human feedback
  • Length of responses generated 
  • Number of back-and-forth chat interactions between user and LLM
  • Sentiment analysis of user prompts in chat. Capturing the sentiment (or tone) of the human prompt during a conversation can be seen as a proxy metric of how well is the model performing
  • Number of safety attributes being triggered. As an example, the Google PaLM API allows you to identify up to 17 different safety attributes and how much each response contains out of those 17 attributes. 
  • Additional guardrails being raised by the model, to monitor how well are the guardrails implemented performing
  • Latency of responses
  • Number of characters passed as input in the prompt and generated as output. This can then be used as a proxy to measure the cost of use cases in production whenever using APIs and can help monitor the business value of every use case

Vector Database

A centralised vector database should be the backbone of an LLMOps platform, to augment the capability of LLMs and improve the context setting. Having a centralised database aims to reduce the cost and duplication of embeddings and minimise the management and maintenance of such a database. For proper use of a vector database in an LLMOps platform, this should have the following features:

  • Infrastructure-as-Code deployments
  • Automatic document ingestion, either on a schedule or automatically triggered when new documents are made available
  • Optional PII data handling to remove sensitive data from the vector database. This is particularly important if use cases select third-party LLM APIs

Conclusion & Recommendations

Throughout this blog we have explained why Generative AI introduced a paradigm shift in the world of AI, and why it reduces the barrier to entry of AI use cases considerably by offering the possibility of using third-party models via an API. This has given rise to prompt engineering, which is a fast-evolving field that consists of coming up with techniques to provide more context to the model and build a better prompt. Multiple techniques exist for this, including the creation of a vector database to hold embeddings of all documents and information. This technique allows you to both improve the quality of the prompt and obtain the sources of information used by the model to generate the answer, thus increasing the confidence in the generation. However, prompt engineering comes with some challenges and new techniques must be implemented to manage these in production.

Overall, the Generative AI product lifecycle can be summarised as shown in the Figure below. An initial product can be quickly built by leveraging the use of an API. However, to maximise the value of your product, you must include a monitoring component to collect human feedback and interactions with the model as soon as possible. Whether your use case requires additional labelling, fine-tuning or product integration will depend on the specific application, and these factors should be assessed frequently. Some of the key factors to analyse are training and inference cost, latency, reliability, data privacy and effort to operate. 

If you decide to fine-tune and host your own model, this will introduce new challenges addressed by LLMOps around data collection and labelling, fine-tuning, latency optimisation, model evaluation and validation and finally, this introduces the concept of Continuous Feedback. Extending the existing MLOps platforms to support LLMs requires developing new capabilities but is aimed to increase the business value of Generative AI models and accelerate the time to production and maintenance of these use cases.

As a final recommendation, Generative AI is such a fast-evolving field with new additions and improvements in the landscape every day (prompt engineering, vector databases, LLMs, etc)  making it difficult to provide a very prescriptive recommendation. For now, the two most important things to keep in mind are:

  • Your application should be built as modular as possible, with little dependency on any specific tool or model. In case of a new model or a new technology appearing in the market, the winners will be the ones that are agile enough to switch to those as fast as possible
  • This blog has introduced some of the main factors to consider when deciding whether to fine-tune a model or not. Given the dynamics of the industry, these trade-offs need to be reassessed constantly and metrics need to be recalculated frequently. The most important metric to consider is “time to human feedback” which will then assist with all of the following metrics. Human feedback and human validation should then be included even in the very early stages of your product. 

 

Datatonic is Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. 

Turn Generative AI hype into business value with our workshops and packages:

  • One-day workshop
  • One-day hackathon
  • PoCs + MVPs

Get in touch to discuss your Generative AI, ML or MLOps requirements!

 

Related
View all
View all
UNDP Solar
Insights
Providing Access to Energy with UNDP
Computer Vision
Partner of the Year Awards
Insights
Datatonic Wins Four 2024 Google Cloud Partner of the Year Awards
Women in Data and Analytics
Insights
Coding Confidence: Inspiring Women in Data and Analytics