Deploying Powerful Machine Learning Models on Google Cloud
In our last blog of this series, Machine Learning Engineer, Hiru Ranasinghe covered how we developed a Natural Language Processing (NLP) model with Hugging Face to generate abstracts for COVID-19 research papers. Now the model has been developed, it is time to deploy it.
In this blog, we will show you how to deploy Machine Learning (ML) models on Google Cloud, and use it to serve predictions via an API, using our NLP model as an example. Once deployed, the model can be integrated with a front end (e.g., a website), such as our COVID-19 Text Summariser.
We will walk through all the steps to take saved model files, like ours from the first blog in this series, to deployment. After reading, you should be able to take an ML model from development and host it on the internet using Google Cloud for members of the public or your business to use.
Background
Before we get into how to deploy your model, it is worth looking at the bigger picture of the steps and services that a prediction request will go through before reaching the deployed model. Google Cloud has many services that have similar functions, so it is useful to understand when to choose each one.
Before the model is deployed, model serving source code is used by Cloud Build to build the serving image. This is then uploaded to the Artifact Registry, and, finally, deployed on Cloud Run. Google Cloud Run is a solution to execute container images on Google Cloud.
When a user wants to get predictions from the model, a prediction request is made via the frontend (e.g., a website) by the user. This request is then sent from the frontend to API Gateway; we’ll discuss this in more detail below. API Gateway then sends this request to the Cloud Run instance running the model serving image. This generates a prediction from the request and sends it back to API Gateway, which in turn displays it on the frontend.
Google Cloud Services
Below, we outline the stages in developing and deploying an ML model along with the rationale for the chosen services. Note that the creation and management of these Google Cloud services can be automated with Terraform.
Model Development: Vertex AI + Google Cloud Storage
Vertex AI is Google’s unified AI platform, which has several features for performing tasks across the ML lifecycle. In particular, Vertex AI offers managed Jupyter notebooks. These can be used for model development in a cloud environment, and training jobs using more powerful hardware can be run to quickly train models. In addition, trained model artifacts can be stored in Google Cloud Storage as the API for fetching these artifacts is easy to use; a single command can be used to fetch the artifacts when the container image is built.
Model development within a cloud environment has several advantages:
- Faster download and upload speeds by using Google’s network
- Lower computational load for local machines
- Swift integration with Google’s other services
- Range of hardware for different ML tasks
Model Deployment: Cloud Run + API Gateway
Deploying models with Google Cloud Run is very simple. Container images can have files written in any language, and all they need to do is run an HTTP server. Using Vertex AI endpoints was another consideration, but container images must follow strict requirements. On the other hand, Cloud Run is completely flexible and poses no restrictions on container images.
Another benefit of using Cloud Run instead of Vertex endpoints is that Cloud Run can scale down to zero instances. Contrarily, Vertex endpoints must always run on a virtual machine. This means that a charge is incurred even when the endpoint receives no requests.
One downside of using Cloud Run is that every time the deployed container changes, its URL changes too. Therefore, in order to preserve the URL to which the model requests are sent, we use API Gateway in front of the Cloud Run instance.
Technical Implementation
The end goal is to build an application that runs an HTTP server which has routes to handle prediction requests. To achieve this, Uvicorn is used to run the HTTP server and FastAPI is used to define the routes of the application.
Our model deployment folder has the following structure:
For now, we will focus on the following:
- Dockerfile – defines the serving container image
- cloudbuild – directory with the Cloud Build configuration to build and push the container image
- helpers.py – defines model inference logic
- main.py – defines the HTTP server and its routes for handling predictions and health checks; this is run by Uvicorn when the server starts
- pyproject.toml – used by Poetry; this manages python dependencies
Below are the steps we’ll take to deploy our model:
- Setting up dependency management with Poetry (Optional)
- Using FastAPI to run an HTTP server with developer-defined routes
- Writing the Dockerfile to define the model serving image
- Building and pushing the model serving image to a registry using Cloud Build
- Deploying the model serving image on Cloud Run
- Setting up the API Gateway
- Testing the model
1. Setting up Dependency Management with Poetry (Optional)
This is an optional step for setting up Poetry, a dependency management tool. A requirements.txt can be used instead, but using Poetry is more scalable and promotes best practices. Poetry automates the creation of a virtual environment to install dependencies into. It also resolves dependency versions that work together and locks these down into a separate file. This way, dependencies don’t need to be resolved by pip each time packages are installed.
First, install Poetry onto your machine:
‘pip install poetry’
In the root directory, initialise Poetry to create a pyproject.toml:
‘poetry init’
In the pyproject.toml, add in package dependencies under [tool.poetry.dependencies]. Read more about how to do this here.
Install the packages into Poetry’s virtual environment:
‘poetry install’
This step can take a while depending on the number of package dependencies. Once dependency management is set up, it is time to start defining the HTTP server using FastAPI.
2. Using FastAPI to Run an HTTP Server with Developer-Defined Routes
In this step, we will explain how to write the main.py to handle prediction requests. We will assume that you have already created the basic Python function that generates predictions from user input (in our case, helpers.py) and can simply import and call it as part of a route.
Install FastAPI and Uvicorn if you haven’t already:
If using pip: ‘pip install fastapi uvicorn’
If you’ve set up dependency management with Poetry: ‘poetry add fastapi’ ‘poetry add uvicorn’
Imports
The imports from the FastAPI library are used to generate the API schema, which API Gateway uses to handle Cross-Origin Resource Sharing (CORS) issues. ‘BaseModel’ is a class used to form the base of request and response objects. It removes all of the complexities used in HTTP requests and responses and allows you to simply define what data you expect to receive and what data you want to send as a response. The ‘List’ class is then used when defining the request and response data.
Requests and Responses
Using the ‘BaseModel’ Class, we can define the structure of the HTTP requests the model expects, and the structure of the HTTP responses which are sent back.
Routes
Here, we define the functions that run when requests are sent to specific routes. The ‘@app.get’ and ‘@app.post’ are decorators that tell our application to run these functions when a GET or POST request is sent to the specified route.
The ‘/predictions’ route takes the input from the user’s request, uses it to call the ‘gen_summary’ function, and returns the prediction. The ‘/api/healthz’ route is used to check if the deployed container works.
API Schema
The API Gateway requires a schema written in the openAPI specification. This spec is a YAML file and can be created using the function we imported earlier. The ‘x-google-backend’ key added to the YAML sets the configuration that the API Gateway will use, such as the Cloud Run URL and the timeout for requests. These two variables are set via our Terraform code which automatically builds this infrastructure.
The final block of code is used so that the API Gateway can handle CORS issues.
Local Testing
To test the server locally, run ‘uvicorn –host 0.0.0.0 main:app –port 7080’. This runs the server from your local machine at port 7080. For full documentation on Uvicorn, see here. A test request can then be sent using the ‘curl’ Unix command or a tool like Postman.
Once the server works, it is time to build the container image that runs this server!
3. Writing the Dockerfile to Define the Model Serving Image
Since we are using Google Cloud Build, any local files in the root of the directory are uploaded to Google Cloud Storage temporarily so they can be used with any COPY statements. However, it would be very time-consuming if the model files were uploaded each time a build is completed as they make up several gigabytes. Therefore, we use a multi-stage build, splitting the build into two steps:
- Copy the model files from Google Cloud Storage into an intermediate builder container image
- Create the serving container image, installing dependencies and copying model files from the intermediate builder container image.
This is done in two steps to make the serving container image as small as possible. For example, the serving container image doesn’t need the Google Cloud SDK. Also, the intermediate builder container image is discarded after the Cloud Build job is complete. The first stage of the build is defined like so in the Dockerfile:
The Dockerfile’s first step uses a lightweight (alpine) tag of the Google Cloud SDK base image, makes a directory, and copies the model files from Google Cloud Storage into it.
The Dockerfile’s second step creates the actual serving container image. It uses the Python slim base image and installs Poetry (optional) into it. We then copy the Poetry files, or requirements.txt, from the local folder into the container image and install the dependencies.
With all the dependencies installed, the other project files are copied into the serving container image. A .dockerignore file can be created to ignore specific files in the copy. We then copy the model files from the first stage into the serving container image.
Lastly, we set environment variables used by the model and run the Uvicorn server that listens for requests.
4. Building and Pushing the Model Serving Image to a Registry Using Cloud Build
Now the model serving image is defined, we need to actually build it and push it to a registry. This is done using Cloud Build. The instructions for Cloud Build are saved in cloudbuild/container_build.yaml. The instructions will depend on what process you use to manage container images, but the Cloud Build documentation explains how to write basic configuration files.
Run ‘gcloud builds submit –project=${GC_PROJECT_NAME} –config=cloudbuild/container_build.yml –timeout=3600’ to start the build process.
The timeout flag has been added to change how long to wait in seconds before cancelling the job. The default is 600 seconds (10 minutes), which is too short given the large model file sizes. When using large model files (>100Mb), it is recommended to extend the timeout to 3600 seconds (1 hour) to ensure the build job has time to complete.
5. Deploying the Model Serving Container Image on Cloud Run
Once the container image is on the Google Container Registry, it can be easily deployed onto Cloud Run.
First, navigate to Cloud Run in your project and click “CREATE SERVICE”.
Ensure the “Deploy one revision from an existing container image” option is chosen and select your container image, a name for the deployed model service, and a region.
For CPU allocation, we chose to only allocate during request processing. This means that billing is done on a per-request basis. In terms of autoscaling, we recommend setting the minimum instance number to at least one, to reduce cold starts.
In terms of security, there are two settings. The Ingress option can be configured to only allow requests from within a Virtual Private Cloud (VPC) network (regardless of the user’s IAM permissions). The Authentication option is to determine whether you want to allow requests from anyone, or only authorised users with IAM permission. We set Ingress to “Allow all traffic” because we are not using a VPC network. Authentication is set to “Require authentication” so that only our middleware and authorised internal users can directly send requests to the model.
Lastly, we set the container to listen for requests on port 7080 as configured in our Dockerfile and set the capacity of the machine.
Once set, feel free to click “Create” and wait a few minutes for the instance to spin up!
6. Setting up the API Gateway
Lastly, we set up the API Gateway which sits in front of our Cloud Run instance.
Generating the Spec
First, we need to actually generate the spec mentioned in the previous section. This is done using a nodeJS package. The FastAPI function generates the spec as a JSON, and we use the converter to turn it into YAML.
If you don’t have Node.js installed, see here. Once it is installed, navigate to the root directory of your project and install the api-spec-converter package:
‘npm install -g api-spec-converter’
We then define a shell script which generates the spec using this package:
Line 18 declares where to output the YAML file. Run the script to generate the spec.
When not using Terraform, the Cloud Run URL will need to be manually input into the spec and will need to be changed each time the Cloud Run instance is changed.
Creating the API Gateway in Google Cloud
An API Gateway in Google Cloud sits within an API object. This API object has a configuration object (which uses our previously generated spec) and a gateway object. To set up an API Gateway, navigate to API Gateway within the Google Cloud Console, then click “Create Gateway”.
API
Choose a display name and ID for the API object.
API Config
Upload the generated API Spec and choose a display name for the config.
Gateway Details
Set a display name for the gateway as well as its location; it is best to put in your own location.
7. Testing the Model
Send a request to the model using cURL or Postman, similar to what we did when testing it locally. The main difference this time is that the URL will be the API Gateway’s URL with ‘/predictions’ at the end. If everything is successful, a prediction will be returned!
Congratulations! By following these steps, you should have successfully deployed your model and will now be able to send requests and receive responses from it!
Datatonic are Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. Need help with developing an ML model, or deploying your Machine Learning models fast? Have a look at our MLOps 101 webinar, where our experts talk you through how to get started with Machine Learning at scale, or get in touch to discuss your ML or MLOps requirements!