MLOps Tools Part 3: Cloud Functions vs. Cloud Run for Model Serving
Part one and two of our MLOps tools series gave you an insight into data transformation tools ( TensorFlow Transform vs. BigQuery) and orchestration tools (Kubeflow Pipelines vs. Cloud Composer). In this post, we will look at model serving on Google Cloud, with a deep dive into two common options: Cloud Functions and Cloud Run. We will discuss reasons for using both tools, as well as highlighting our experiences of them in practice.
Before we dive into these two tools, we first need to understand the different ways that you can obtain predictions from a machine learning model. Once a model has been created and saved, there are two options for using it:
- make predictions on blocks of historical data already stored somewhere;
- or make predictions on brand-new, live data coming in.
The first type is known as offline predictions, and is common when there isn’t a strict latency requirement in predictions (for example, providing propensity scores to help align marketing campaigns). Here, of key importance is a model being able to handle large blocks of request data coming in all at once.
The second type is known as online predictions, and is much more common when models are being used in a real-time scenario (for example, providing real-time content recommendations on a website). Challenges here are more extensive and include being able to handle differences in volumes of requests; supplying responses as quickly as possible; and ensuring minimal or no downtime in deploying updated models.
For this blogpost, we will discuss Cloud Functions and Cloud Run in the context of serving online requests on Google Cloud Platform.
Cloud Functions (see more here) is Google Cloud’s FaaS (Functions as a Service) offering, allowing you to deploy ad-hoc functions on Google Cloud Platform without worrying about the underlying infrastructure. There are several ways of triggering these functions; for model serving, users typically wrap the model in a simple Python function, which will accept data from a HTTP request and return model predictions.
A typical serving architecture might look like the following:
Here, a HTTP request is received, which triggers a Python function to be executed. The function takes in JSON data from the request, passes this to the model which returns predictions, and these predictions are returned in the response to the HTTP request.
Reasons to use Cloud Functions
- Ease of deployment: Cloud Functions are a very simple way to deploy models. Users just need to write a python script with the model they want to deploy with no additional overhead required. Cloud Functions provides the rest of the framework for you, generating a HTTP endpoint for requests. Functions are quick to deploy and easy to update, and allow easy configuration for ingress, egress and authentication. Cloud Run requires much more engineering knowledge: users need to be familiar with the concepts of containers, and an understanding of Kubernetes also helps. The model itself also needs to be wrapped in an additional serving framework (such as TensorFlow Serving) before it can be containerised, adding further complexity.
- Cheap, simple-to-manage-resources: With Cloud Functions, you only pay for the resources used at the time of request and the number of invocations, meaning they are a cheap option for deployment. You do not pay for idle server time (resources scale down when not in use anyway). Cloud Functions even provides a perpetual free tier for both computing resources and the number of invocations (2 million invocations, 400,000 GB-seconds, 200,000 GHz-seconds of compute time, and 5GB of Internet egress traffic-free per month), allowing you to experiment with model deployment at no charge. Whilst the fully-managed version of Cloud Run also has a similar free tier and pricing system, you have to pay more attention to managing resources used when deploying via Cloud Run in a Kubernetes cluster in Google Cloud or on-prem. In this case, Cloud Run will only scale the number of pods in a node down to zero, but not the number of nodes in a cluster.
Cloud Functions was the serving tool of choice in our recent engagement with News UK. Its ease of use made it simple to integrate with the client’s existing APIs, whilst still performing under strict latency requirements. It was also very simple to deploy our chosen model architecture with Cloud Functions, and the generated endpoint performed well under stress testing simulating the expected number of requests in production.
Cloud Run (see more here) is a managed version of the open source project Knative on Google Kubernetes Engine. It allows you to easily serve models that have been deployed in a container, without needing to worry about the underlying compute infrastructure. If you already have an existing Kubernetes cluster, you can also specify to serve a model in that cluster (using Cloud Run for Anthos), otherwise, Cloud Run will deploy the service in a Kubernetes cluster managed by Google Cloud.
Now, when a HTTP request is made, it is routed to a particular machine within a Kubernetes cluster, which contains the model pulled from Google Container Registry, and is deployed via Cloud Run. The model accepts data sent to the container and returns predictions, and these predictions are returned in response to the HTTP request.
Reasons to use Cloud Run
- Model freedom: A container-based approach removes any restrictions on the choice of model, as you can use any custom system packages, libraries, and languages to your heart’s content. Versioning of models is also straightforward via versioning in Container Registry. With Cloud Functions there are a limited number of system packages made available to the execution environment. Whilst the Python runtime accepts pip-based Python libraries or local packaged dependencies, this still doesn’t cover all the options.
- Portability: With Cloud Run, the container can be deployed fully managed, into an existing Kubernetes cluster on Google Cloud Platform, or even into an existing cluster on-prem (with the help of Cloud Run for Anthos). Cloud Run is also available in slightly more regions (21 regions compared to 19 regions for Cloud Functions).
- CPU configuration: Even when using the fully managed version of Cloud Run, there are still several options in customising the compute power available to the containers when making predictions: the CPU allocated to each container; the RAM allocated to each container; and the maximum number of concurrent requests a single container can serve. There is a memory limit of 4GB on the size of a function deployed via Cloud Functions, meaning that this tool may not be suitable for significantly larger models. In addition, the amount of CPU available scales with the memory (up to 4.8GHz), so it is not possible to adjust separately the amount of CPU needed.
- Traffic management: Cloud Run includes features to manage the traffic between containers (Canary rollouts and A/B Testing) as well as to limit traffic to authorised users or to requests incoming from a particular network. It is also possible to have multiple endpoints for a model; for example, one for batched requests and one for single requests. Cloud Functions has options to restrict traffic, but not to manage or redirect traffic. Each Cloud Functions instance has only one endpoint, and can only serve one request at a time.
Cloud Run was the clear suggestion for a recent project we undertook on fraud detection with a world leader in payments processing. The ability of Cloud Run to manage traffic between containers means it would be perfect for implementing a Champion/Challenger model infrastructure on a Kubernetes cluster which the client could maintain. Moreover, with Cloud Run for Anthos, there is also the ability to utilise GPUs in serving on GCP, allowing us to use model frameworks and libraries specifically optimised for GPU resource consumption. Finally, the client wanted more control over the scaling of the serving infrastructure, in order to easily handle the extreme volume of transactions they process.
Overall, Cloud Functions is a useful tool to quickly deploy a simple model, whilst making use of a typical data scientist’s existing skillset. If you have more engineering expertise at hand, need to make use of custom binaries and libraries, or need a more complex serving architecture to incorporate with your own infrastructure, then we heavily recommend Cloud Run as the more suitable option.
Cost-wise, Cloud Functions and Cloud Run are comparable when handling a small number of single requests synchronously. However, when your request load increases, and you need to handle requests concurrently, Cloud Run quickly becomes the cheaper and faster option.
We hope you enjoyed our MLOps tool comparison blog for Model Serving. In the next part, we look at different tools for monitoring model performance on Google Cloud.
Want to know more about MLOps? Download our 2021 Guide to MLOps here.