Adding Near-Complete Autonomy to dbt Continuous Deployment Using “Slim Backfills”
Author: Ash Sultan, Analytics Engineering Lead
dbt is a great tool for data transformation and is popular with both open-source communities and enterprise-grade users. dbt’s core version is open-source and highly adaptable to all main Cloud data warehouse solutions, such as Google BigQuery and Snowflake. This flexibility and versatility allow for some very cool implementations for solving data management challenges.
dbt has powerful features to deal with data management of different configurations such as incremental models, snapshots, etc. Within incremental configurations, there are further options such as setting on_schema_change properties to handle table schema changes.
However, depending on your use case, sometimes a full refresh of models is necessary. A common challenge faced by dbt users is the issue of conducting a full refresh, also known as backfilling in production.
In this blog, we will discuss scenarios where a full refresh is necessary when there has been either a schema or code change for a model and how we can give the CI/CD process near-complete autonomy by implementing capabilities to organically handle backfills.
Background & Assumptions
Our aim is to create a near-100% autonomous and intelligent CI/CD process for the following scenario and assumptions:
1. We are deploying in Google Cloud, using BigQuery for data storage.
2. We are using the Command Line Interface (CLI) version of dbt.
3. We have all the necessary infrastructure and security provisions in place for CI/CD processes and dbt workflow execution, including any and all Service Account (SA) requirements.
4. For most incremental analytical models, we allow the CD process to run backfills only for the models where there has been either a schema or code change. The reasons for this are:
- Most upstream tables in the raw layer are already in existence
- Data was originally available but business requirements excluded certain fields from being pulled into the analytical layer
- Using the dbt on_schema_change config for incremental models isn’t sufficient because they do not deal with historical data
5. The models which are out-of-scope for this automation are configured as full_refresh = false. Other mechanisms, such as specific tags, can also be used to ensure non-idempotent and/or large loads are not part of any “blunt instrument” automated backfill.
Using the dbt “state” method is the most preferable choice to create this automated backfill strategy into the CD process (see Understanding dbt “state” method).
Key Architecture, Products & Services Used
Throughout this process, we used the following:
- Datatonic’s DOP framework on Cloud Composer to run the dbt CLI in Google Cloud
- Datatonic’s enterprise data warehouse (EDW/DWH), a “2-tier” data architecture that includes a DWH layer and an Analytical layer, with an additional “staging” layer which acts as a transition layer between raw and DWH or DWH to analytical layers.
- For data transformation & management, our solution implements a workflow in productions and runs a set of dbt tasks by using dbt tags (for further reading, see dbt tags) per the illustration below:
- GitHub for source control and CI/CD
For this example, we are using both a UAT and production environment with a CI/CD strategy in line with our branch workflow (any merged code to UAT and production only when a release tag is created and per naming convention (e.g., contains “release”))
Inspiration From Slim CI
One of the key inspirations for solving our challenge was to understand dbt Labs’ “Slim CI” solution.
The fully-managed version of dbt, dbt Cloud, offers Slim CI. Slim CI provides a simpler, more efficient, and cost-effective way to test new and modified dbt models before deploying them into production. By surfacing the job’s most recent successfully-run artefacts, dbt Cloud determines the set of new and modified resources. Thus, we can choose to run only the dbt models for which any code has changed.
For further reading, have a look at Understanding dbt Cloud Slim CI.
The key methods and techniques behind this process enabled our solution to handle backfills for models where there has been a change and it is determined that the table should be rebuilt and backfilled:
- Using Slim CI for surfacing and comparing previous run artefacts to the latest ones, specifically: manifest.json
- Using dbt “state” methods and leveraging the power of dbt to create a targeted framework for a run
Our Git Strategy
It was vital for us to review the challenge and decide on the best approach to branch workflow for our solution.
For our solution, the branch workflow has a protected master branch, and each new feature is developed in short-lived feature branches, which are merged back into the Master branch before being deleted.
Here is an outline of the process:
- We have a master/main branch which contains the most stable release of our code.
- Each person creates a feature branch off the master branch to do their work in.
- Once each feature is complete, it is then merged into the master branch.
- Feature branches are used to develop single features only, and will be deleted once complete.
- Releases are created by tagging the master branch.
Our Solution: Continuous Deployment With “Slim Backfills”
In the case of our deployment, “Slim Backfills” are a workflow designed to create maximum automation in the CI/CD process by targeting a “full refresh” of models that have been identified as “modified” when changes are merged to the master branch.
This is established by creating a “360-process” in CI/CD where the previous version of the manifest file is stored in a GCS bucket and later used to compare models/nodes in dbt to determine — using the dbt “state” method — which models have been changed and require backfilling. The design thus allows a very high degree of autonomy in the process of CI/CD.
Here are the steps we followed in the order of implementation:
- For the deployed production workflow of the dbt run, save the manifest.jsonfile to a GCS store. This file should contain the state of the deployed models as of the currently deployed version (of a data warehouse or analytical project, etc.)
- For the CI/CD pipeline, create a set of processes that will set off a chain of events when a new deployment is triggered.
- Retrieve the previous manifest.json from the GCS bucket, e.g., using a gsutil command like:
gsutil cp gs://manifest-location-prod/manifest.json manifest.json
- Use the dbt “state” method to compare the new version (default) of the manifest.json and compare to the retrieved version to determine the models where there has been a change, either to the schema or the code
- Run the command for the slim backfill; a basic version of this command will be:
dbt run --full-refresh --select state:modified+ --state .
- Once this has been completed, continue with an incremental/regular run per the original deployment and post-deployment processes
Update The Production dbt Workflow
We decided to update the production workflow so that it should save all the artefacts in the dbt target folder to a GCS bucket as illustrated in the process diagram below:
This process now implements the first requirement for our target solution to build “slim backfill” into the CD process. It stores the manifest.json as part of the process which saves all of the dbt target folder content.
Implementation of Slim Backfill to CD Process
After implementing a process which stores the artefacts, including those from the currently-deployed version of our dbt solution, we can now implement the full “360-process” for creating near autonomy to the CD process, as shown in the diagram below:
By creating a “360-process”, we can now ensure that the continuous deployment process has near-complete autonomy by being able to handle backfills/model rebuild & reload in a targeted and smart way. This approach, on top of bringing a high degree of autonomy, also makes the CD process very efficient and cost-effective, with its highly targeted nature.
Whilst it’s great to have a highly independent solution for CI/CD on paper, there are some key reservations to keep in mind when implementing a solution which gives so much power to an autonomous framework:
1. This solution isn’t “one size fits all”; a full refresh for models done autonomously may not be ideal for some organisations and scenarios, especially if models involve critical business requirements. Some examples include:
- Key customer or business data
- Very large models that may take many hours to rebuild
2. In some cases, load idempotency is an issue; you should either set full refresh to false or not use an automated backfill framework for CD.
3. Generally, automated table rebuilds can be perceived as quite risky for Data Governance and Integrity.
Some of these can be mitigated by either disabling backfills for critical models or creating a bit more of a workflow around the slim backfills themselves such as using tags or selectors to only backfill or exclude from backfill certain groups of models. That’s one of the great things about dbt — there are many ways to solve challenges around data transformation solutions!
Datatonic, 4x Google Cloud Partner of the year, and Google Cloud’s Machine Learning Partner of the Year, is the leading cloud data + AI consultancy for the world’s most ambitious brands, challenging traditions to deliver tangible innovation at the leading edge of Google Cloud. We have a wide range of experience with dbt. Contact us to find out more.