Author: Steve Mudute-Ndumbe, Machine Learning Engineer
Welcome to the last post of our blog series on MLOps on Google Cloud Platform!
In previous weeks we have released blogs covering Data Transformation, Orchestration, Serving, and Monitoring. In this final part of the series, we will look at implementing a feature store on Google Cloud Platform (GCP) in two ways: using a combination of BigQuery and Memorystore; and using the open-source tool FEAST.
Feature stores are often an overlooked part of the MLOps journey, and yet their role can be pivotal in operationalising machine learning. A feature store can be best described as the interface between the raw data and the model, or between a data engineer and a data scientist. It allows data engineers to ingest raw data into a central repository ready for data scientists to experiment on, and allows data scientists to consistently create and retrieve features for various model use cases. However, a good feature store is much more than just storing feature data: serving features consistently, monitoring feature performance, storing feature metadata, and even managing the data transformation and ingestion processes are also of key importance.
In this diagram we can see the core components of a feature store split into multiple sections. Whilst storage is normally the main focus in mind, there is also transformation, serving, operational monitoring, and a registry to think about.
In addition, the storage capacity of a feature store is normally split into an offline store and an online store. Offline stores contain the archive of feature data, which can be used for model training or in batch predictions. The online store serves the latest known value of the features with minimum latency (see our third post titled “Cloud Functions vs. Cloud Run for Model Serving” for more on online serving on Google Cloud Platform).
Feature stores are still a novel idea to a lot of teams, with implementations still in their infancy. We will now explore two different ways of implementing a feature store on Google Cloud Platform.
We previously introduced BigQuery in the first post called “TensorFlow Transform vs. BigQuery for Data Transformation”. Here it was viewed as a tool for data transformation, and we focused more on its compute capabilities. BigQuery also offers traditional RMDBS storage in the form of datasets and tables which can easily be queried with SQL, making it an ideal choice for the offline store component of a feature store.
Meanwhile, Memorystore (see more here) is Google’s managed implementation of the open-source key:value database Redis, which offers in-memory caching of non-relational data. An in-memory database is a good choice for an online store, as it enables extremely low latency feature lookups. Redis also has the advantage of being able to store different types of values depending on the feature, optimising retrieval of items like Strings, Lists, and Hashes separately.
Combining both BigQuery and Memorystore thus gives us a managed offline and online store on Google Cloud Platform.
FEAST (see more here) is an open source feature store tool, which was developed by Google Cloud and GO-JEK and released in 2019. It combines both an offline and online store into one unified tool.
On top of offline and online storage, FEAST also offers a feature registry, allowing users to create, share and reuse features all through a centralised platform; as well as a dedicated python SDK for interacting with the store. FEAST stores features using the entity data model:
Using this structure allows users to impose a strict hierarchy and organisation of features, making it clear which features are linked to which entities and where they come from.
Whilst BigQuery combined with Memorystore will cover your storage needs and provides the lowest online latency, a dedicated feature store also needs much more functionality.
We would recommend using BigQuery + Memorystore if you want a simple managed offering of an offline and online store backed by strict SLAs. With FEAST comes more organisation and streamlined tooling, but at the cost of deploying and managing the library yourself. However, both offerings still lack some desired functionality, especially with regards to orchestrating the data transformations before ingestion (See the second post on “Kubeflow Pipelines vs. Cloud Composer for Orchestration” for more on orchestration tools).
Google Cloud Platform recently announced that they will be adding a managed feature store service to AI Platform, which will look to provide many of the additional options that a dedicated feature store requires. With word of other companies also producing managed feature store offerings, we expect plenty of development around this area of MLOps in 2021 and beyond.
Thank you for following this blog series – we hope it will help you to refine your ML journey moving forward. We’ll be releasing plenty more content in 2021 (including whitepapers as well as blog posts), so stay tuned!
In the meantime, check out our 2021 Guide to MLOps for information on getting started!
Know exactly where and how to start your AI journey with Datatonic’s
three-week AI Innovation Jumpstart *.
* Duration dependent on data complexity and use case chosen for POC model
With your own data sets, convince your business of the value of migrating your data warehouse, data lake and/or streaming platform to the cloud in four weeks.
With your own data, see how Looker can modernise your BI needs
with Datatonic’s two-week Showcase.