dbt is a great tool for data transformation and is popular with both open-source communities and enterprise-grade users. dbt’s core version is open-source and highly adaptable to all main Cloud data warehouse solutions, such as Google BigQuery and Snowflake. This flexibility and versatility allow for some very cool implementations for solving data management challenges.
dbt has powerful features to deal with data management of different configurations such as incremental models, snapshots, etc. Within incremental configurations, there are further options such as setting on_schema_change properties to handle table schema changes.
However, depending on your use case, sometimes a full refresh of models is necessary. A common challenge faced by dbt users is the issue of conducting a full refresh, also known as backfilling in production.
In this blog, we will discuss scenarios where a full refresh is necessary when there has been either a schema or code change for a model and how we can give the CI/CD process near-complete autonomy by implementing capabilities to organically handle backfills.
Our aim is to create a near-100% autonomous and intelligent CI/CD process for the following scenario and assumptions:
1. We are deploying in Google Cloud, using BigQuery for data storage.
2. We are using the Command Line Interface (CLI) version of dbt.
3. We have all the necessary infrastructure and security provisions in place for CI/CD processes and dbt workflow execution, including any and all Service Account (SA) requirements.
4. For most incremental analytical models, we allow the CD process to run backfills only for the models where there has been either a schema or code change. The reasons for this are:
5. The models which are out-of-scope for this automation are configured as full_refresh = false. Other mechanisms, such as specific tags, can also be used to ensure non-idempotent and/or large loads are not part of any “blunt instrument” automated backfill.
Using the dbt “state” method is the most preferable choice to create this automated backfill strategy into the CD process (see Understanding dbt “state” method).
Throughout this process, we used the following:
For this example, we are using both a UAT and production environment with a CI/CD strategy in line with our branch workflow (any merged code to UAT and production only when a release tag is created and per naming convention (e.g., contains “release”))
One of the key inspirations for solving our challenge was to understand dbt Labs’ “Slim CI” solution.
The fully-managed version of dbt, dbt Cloud, offers Slim CI. Slim CI provides a simpler, more efficient, and cost-effective way to test new and modified dbt models before deploying them into production. By surfacing the job’s most recent successfully-run artefacts, dbt Cloud determines the set of new and modified resources. Thus, we can choose to run only the dbt models for which any code has changed.
For further reading, have a look at Understanding dbt Cloud Slim CI.
The key methods and techniques behind this process enabled our solution to handle backfills for models where there has been a change and it is determined that the table should be rebuilt and backfilled:
It was vital for us to review the challenge and decide on the best approach to branch workflow for our solution.
For our solution, the branch workflow has a protected master branch, and each new feature is developed in short-lived feature branches, which are merged back into the Master branch before being deleted.
Here is an outline of the process:
In the case of our deployment, “Slim Backfills” are a workflow designed to create maximum automation in the CI/CD process by targeting a “full refresh” of models that have been identified as “modified” when changes are merged to the master branch.
This is established by creating a “360-process” in CI/CD where the previous version of the manifest file is stored in a GCS bucket and later used to compare models/nodes in dbt to determine — using the dbt “state” method — which models have been changed and require backfilling. The design thus allows a very high degree of autonomy in the process of CI/CD.
Here are the steps we followed in the order of implementation:
gsutil cp gs://manifest-location-prod/manifest.json manifest.json
dbt run --full-refresh --select state:modified+ --state .
We decided to update the production workflow so that it should save all the artefacts in the dbt target folder to a GCS bucket as illustrated in the process diagram below:
This process now implements the first requirement for our target solution to build “slim backfill” into the CD process. It stores the manifest.json as part of the process which saves all of the dbt target folder content.
After implementing a process which stores the artefacts, including those from the currently-deployed version of our dbt solution, we can now implement the full “360-process” for creating near autonomy to the CD process, as shown in the diagram below:
By creating a “360-process”, we can now ensure that the continuous deployment process has near-complete autonomy by being able to handle backfills/model rebuild & reload in a targeted and smart way. This approach, on top of bringing a high degree of autonomy, also makes the CD process very efficient and cost-effective, with its highly targeted nature.
Whilst it’s great to have a highly independent solution for CI/CD on paper, there are some key reservations to keep in mind when implementing a solution which gives so much power to an autonomous framework:
1. This solution isn’t “one size fits all”; a full refresh for models done autonomously may not be ideal for some organisations and scenarios, especially if models involve critical business requirements. Some examples include:
2. In some cases, load idempotency is an issue; you should either set full refresh to false or not use an automated backfill framework for CD.
3. Generally, automated table rebuilds can be perceived as quite risky for Data Governance and Integrity.
Some of these can be mitigated by either disabling backfills for critical models or creating a bit more of a workflow around the slim backfills themselves such as using tags or selectors to only backfill or exclude from backfill certain groups of models. That’s one of the great things about dbt — there are many ways to solve challenges around data transformation solutions!
Datatonic, 4x Google Cloud Partner of the year, and Google Cloud’s Machine Learning Partner of the Year, is the leading cloud data + AI consultancy for the world’s most ambitious brands, challenging traditions to deliver tangible innovation at the leading edge of Google Cloud. We have a wide range of experience with dbt. Contact us to find out more.
Know exactly where and how to start your AI journey with Datatonic’s
three-week AI Innovation Jumpstart *.
* Duration dependent on data complexity and use case chosen for POC model
With your own data sets, convince your business of the value of migrating your data warehouse, data lake and/or streaming platform to the cloud in four weeks.
With your own data, see how Looker can modernise your BI needs
with Datatonic’s two-week Showcase.